top | item 43244375

(no title)

Author here and I welcome the feedback, but I don't really understand your point. My post is clearly not dismissive of efforts to make LLMs reason using CoT prompting techniques and post-training, and I think such efforts are even mentioned. The model remains autoregressive either way, and this reasoning is not some kind of magic that makes them behave differently - these improvements only make them perform (much) better on given tasks.

Additionally, I'm not dismissive of the non-linear nature of transformers which I'm familiar with. Attention mechanism is a lot more complex than a linear relationship between the prediction and the past inputs, yes. But the end result remains sequential prediction. Ironically, diffusion models are kind of the opposite: sequential internally, parallel prediction at each step.

(Note: added note on terminology since the confusion arised by my use of "linearity", which was not referring to the attention mechanism itself. I've read so many papers that are perfectly fine with the use of "autoregressive" for this paradigm that I forgot some people coming from traditional statistics may be confused. Also "based on the last word" was wrong and meant "last words" or "previous words", obviously.)

All that being said, I don't think it's fair to say one doesn't understand how transformers work solely because of semantic interpretation. I appreciate the feedback though!

discuss

nikhilsimha|1 year ago

Not saying that our current approaches will lead to intelligence. No one can know.

It could very well be that the internal mechanism of our thought has an auto-regressive reasoning component.

With the full system effectively "combining" short term memory (what just happened) and "pruned" long-term memory (what relevant things i know from the past) and pushing that into a RAW autoregressive reasoning component.

It is also possible that another specialized auto regressive reasoning component is driving the "prune" and "combine" operations. This whole system could be solely represented in the larger network.

The argument that "intelligence cannot be auto-regressive" seems to be without basis to me.

> there is strong evidence that not all thinking is linguistic or sequential.

It is possible that a system wrapping a core auto-regressive reasoner can produce non-sequential thinking - even if you don't allow for weight updates.

Wonderfall|1 year ago

I completely agree. I never said that "intelligence cannot be auto-regressive", I just questioned whether this can be achieved or not this way. And I don't actually have answers, I just wrote down some thoughts so it would sparkle some interesting discussions about that, and I'm glad it did work (a little) in the end.

I also mentioned that I'm supportive of architectures that will integrate autoregressive components. Totally agree with that.