(no title)
Wonderfall | 1 year ago
Additionally, I'm not dismissive of the non-linear nature of transformers which I'm familiar with. Attention mechanism is a lot more complex than a linear relationship between the prediction and the past inputs, yes. But the end result remains sequential prediction. Ironically, diffusion models are kind of the opposite: sequential internally, parallel prediction at each step.
(Note: added note on terminology since the confusion arised by my use of "linearity", which was not referring to the attention mechanism itself. I've read so many papers that are perfectly fine with the use of "autoregressive" for this paradigm that I forgot some people coming from traditional statistics may be confused. Also "based on the last word" was wrong and meant "last words" or "previous words", obviously.)
All that being said, I don't think it's fair to say one doesn't understand how transformers work solely because of semantic interpretation. I appreciate the feedback though!
nikhilsimha|1 year ago
It could very well be that the internal mechanism of our thought has an auto-regressive reasoning component.
With the full system effectively "combining" short term memory (what just happened) and "pruned" long-term memory (what relevant things i know from the past) and pushing that into a RAW autoregressive reasoning component.
It is also possible that another specialized auto regressive reasoning component is driving the "prune" and "combine" operations. This whole system could be solely represented in the larger network.
The argument that "intelligence cannot be auto-regressive" seems to be without basis to me.
> there is strong evidence that not all thinking is linguistic or sequential.
It is possible that a system wrapping a core auto-regressive reasoner can produce non-sequential thinking - even if you don't allow for weight updates.
Wonderfall|1 year ago
I also mentioned that I'm supportive of architectures that will integrate autoregressive components. Totally agree with that.