(no title)
itkovian_ | 9 months ago
Like a lot of the symbolic/embodied people, the issue is they don’t have a deep understanding of how the big models work or are trained, so they come to weird conclusions. Like things that aren’t wrong but make you go ‘ok.. but what you trying to say’.
E.g ‘Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.’ Seems to lack the understanding that a vision transformer is completely identical for a standard transformer except for the tokenization which is just embedding a grid of patches and adding positional embeddings. Transformers are so general, what he’s asking us to do is exactly what everyone is already doing. Everything is early fusion now too.
“The overall promise of scale maximalism is that a Frankenstein AGI can be sewed together using general models of narrow domains.” No one is suggesting this.. everyone wants to do it end to end, and also thinks that’s the most likely thing to work. Some suggestions like lecuns jepa’s do suggest to induce some structure in the arch, but still the driving force there is to allow gradients to flow everywhere.
For a lot of the other conclusions, the statements are literally almost equivalent to ‘to build agi, we need to first understand how to build agi’. Zero actionable information content.
nemjack|9 months ago
If I understand correctly he would advocate for something like rendering text and processing it as if it were an image, along with other natural images.
Also, I would counter and say that there is some actionable information, but its pretty abstract. In terms of uniting modalities he is bullish on tapping human intuition and structuralism, which should give people pointers to actual books for inspiration. In terms of modifying the learning regime, he's suggesting something like an agent-environment RL loop, not a generative model, as a blueprint.
There's definitely stuff to work with here. It's not totally mature, but not at all directionless.
itkovian_|9 months ago
On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!