top | item 44190909

(no title)

itkovian_ | 9 months ago

I don’t want to bash the guy since he’s still in his phd, but it’s written in such a confident tone for something that is so all over the place that I think it’s fair game.

Like a lot of the symbolic/embodied people, the issue is they don’t have a deep understanding of how the big models work or are trained, so they come to weird conclusions. Like things that aren’t wrong but make you go ‘ok.. but what you trying to say’.

E.g ‘Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.’ Seems to lack the understanding that a vision transformer is completely identical for a standard transformer except for the tokenization which is just embedding a grid of patches and adding positional embeddings. Transformers are so general, what he’s asking us to do is exactly what everyone is already doing. Everything is early fusion now too.

“The overall promise of scale maximalism is that a Frankenstein AGI can be sewed together using general models of narrow domains.” No one is suggesting this.. everyone wants to do it end to end, and also thinks that’s the most likely thing to work. Some suggestions like lecuns jepa’s do suggest to induce some structure in the arch, but still the driving force there is to allow gradients to flow everywhere.

For a lot of the other conclusions, the statements are literally almost equivalent to ‘to build agi, we need to first understand how to build agi’. Zero actionable information content.

discuss

order

nemjack|9 months ago

I don't think you're quite right. The author is arguing that images and text should not be processed differently at any point. Current early fusion approaches are close, but they still treat modalities different at the level of tokenization.

If I understand correctly he would advocate for something like rendering text and processing it as if it were an image, along with other natural images.

Also, I would counter and say that there is some actionable information, but its pretty abstract. In terms of uniting modalities he is bullish on tapping human intuition and structuralism, which should give people pointers to actual books for inspiration. In terms of modifying the learning regime, he's suggesting something like an agent-environment RL loop, not a generative model, as a blueprint.

There's definitely stuff to work with here. It's not totally mature, but not at all directionless.

itkovian_|9 months ago

Saying we should tokenize different modalities the same would be analogous to saying that in order to be really smart, a human has to listen with its eyes. At some point there has to be SOME modality specific preprocessing. The thing is in all current sota arch.’s this modality specific preprocessing is very very shallow, almost trivially shallow. I feel this is the peice of information that may be missing for people with this view. In the multimodal models everything is moving to a shared representation very rapidly - that’s clearly already happening.

On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!