I don’t think we are at a plateau. We may have fed a large amount of text into these models, but when you add up all other kinds of media, images, videos, sound, 3D models, there’s a castle more rich dataset about the world. Sora showed that these models can learn a lot about physics and cause and effect just from video feeds. Once this is all combined together into multimodal mega models then we may be closer to the plateau.
No comments yet.