top | item 43243569

Some thoughts on autoregressive models

79 points| Wonderfall | 1 year ago |wonderfall.dev

58 comments

order

fergal_reid|1 year ago

Similar arguments to LeCun.

People are going to keep saying this about autoregressive models, how small errors accumulate and can't be corrected, while we literally watch reasoning models say things like "oh that's not right, let me try a different approach".

To me, this is like people saying "well NAND gates clearly can't sort things so I don't see how a computer could".

Large transformers can clearly learn very complex behavior, and the limits of that are not obvious from their low level building blocks or training paradigms.

dartos|1 year ago

> while we literally watch reasoning models say things like "oh that's not right, let me try a different approach".

Not saying I disagree with your premise that errors can’t be corrected by using more and more tokens, but this argument is weird to me.

The model isn’t intentionally generating text. The kinds of “oh let me try a different approach” lines I see are often followed by the same approach just taken. I wouldn’t say most of the time, but often enough that I notice.

Just because a model generates text doesn’t mean that the text actually represents anything at all, let alone a reflection of an internal process.

PartiallyTyped|1 year ago

I'd argue that humans are by definition autoregressive "models", and we can change our minds mid thought as we process logical arguments. The issue around small errors accumulating makes sense if there is no sense of evaluation and recovery, but clearly, both evaluation and recovery is done.

Of course, this usually requires the human to have some sense of humility and admit their mistakes.

I wonder, what if we trained more models with data that self-heals or recovers mid sentence?

yorwba|1 year ago

As the number of self-corrections increases, it also increases the likelihood that it will say "oh that's not right, let me try a different approach" after finding the correct solution. Then you can get into a second-guessing loop that never arrives at the correct answer.

If the self-check is more reliable than the solution-generating process, that's still an improvement, but as long as the model makes small errors when correcting itself, those errors will still accumulate. On the other hand, if you can have a reliable external system do the checking, you can actually guarantee correctness.

energy123|1 year ago

Yann LeCun's prediction was empirically refuted. He says that the longer LLMs run, the less accurate they get. OpenAI showed the opposite is true.

Wonderfall|1 year ago

LeCun is for sure a source of inspiration, and I think he has a fair critique that still holds true despite what people think when they see reasoning models in action. But I don't think like him that autoregressive models are a doomed path or whatever. I just like to question things (and don't have absolute answers).

I-JEPA and V-JEPA have recently shown promising results as well.

Tostino|1 year ago

I think recurrent training approaches like those discussed in COCONUT and similar papers show promising potential. As these techniques mature, models could eventually leverage their recurrent architecture to perform tasks requiring precise sequential reasoning, like odd/even bit counting that current architectures struggle with.

aoeusnth1|1 year ago

I think the author is projecting significantly when he says the goal of AI researchers is to understand and replicate how humans think. If you start from that wrong assumption of course it looks silly for them to be doing anything other than neuroscience research, the author's field.

It's like saying the stockfish developers should stop researching mixed NN and search methods because they don't understand how humans play chess yet.

Wonderfall|1 year ago

This is mainly a misunderstanding due to the way I phrased it. This is what I think. I know for a fact that is the case for other AI researchers having watched many conferences - "all of them" is not what I meant (I wrote "many other") and we certainly need people to approach problems from different perspectives and backgrounds, since they will benefit from each other in the end. Not going to lie I'm a bit disappointed to see these kind of comments.

ziofill|1 year ago

Is anyone aware of a formalization of the idea that to get “symbols” out of fuzzy probability distributions one needs distributions whose value goes exactly to zero over some regions of the domain? I.e. Gaussian mixtures won’t cut it. And they will need very high Fourier frequencies.

I have the gut feeling that until a model allows for a small probability that 2x3 is 7, there will always be hallucinations. Probabilities need to be clamped to zero to emulate symbolic behaviour.

TeMPOraL|1 year ago

Symbolic behavior is artificial and not how humans think either. 0 is not a probability (neither is 1) - a value of 0 or 1 basically breaks calculations by dragging everything along to the limit, the same way infinity does, or 0 in the denominator (in fact, that's what 1 and 0 translate to if you switch to logprobs or other equivalent ways to calculate probabilities).

Consider: if you clamp the probability distribution of answers to 2x3, so that it's 0 everywhere else and 1 at 6, you're basically saying that it is fundamentally impossible for you to misunderstand the question, or make mistake in the answer, or that you're dreaming, or hallucinating, or that you've momentarily forgotten that the question was preceded by "In base 4, what is ", or any number of other things that absolutely are possible, even if highly unlikely, in the real world.

toxik|1 year ago

We do clamp probabilities to zero. Look into top-p sampling or nucleus sampling.

hansvm|1 year ago

> By design, AR models lack planning and reasoning capabilities. If you generate one word at a time, you don’t really have a general idea of where you’re heading.

I have one minor quibble here, which is that the limitation described isn't a criticism of AR models (whose outputs are only "backward-looking" for their inputs), but just a subset of AR models in popular use. An AR model is fully capable of generating a large state space and doing many computations (even doing many full-connected diffusion steps) before generating the first output token.

That quibble wouldn't be worth mentioning unless AR models had some sort of advantage, but they do, and it's incredibly important. AR factorization of the conditional probabilities allows you to additively consider the loss contribution from each output token -- you can blindly shove whatever data you want into the thing, add up all the errors, and backpropagate, all while guaranteeing that the distribution you're learning is the same distribution from your training data.

If you're not careful, via some mechanism (like AR), the distribution you learn will have almost nothing to do with the distribution you're training on -- a common failure mode being a tendancy to predict "average-looking" sub-tiles in a composite image and only predict images which can be comprised out of those smaller, averge-looking sub-tiles. Imagine (as an example, with low enough model capacity), you had a model generating people and everyone was vaguely 5'10", ambiguously gendered, and a bit tan, contrasted with that same model trained using AR where you'd expect the outputs to be bad in other ways if you had insufficient capacity but to at least have a mix of colors, heights, and genders. Increasing capacity can help, but why bother when something like AR solves it by definition?

mxwsn|1 year ago

> But what is the original purpose of AI research? I will speak for myself here, but I know many other AI researchers will say the same: the ultimate goal is to understand how humans think. And we think the best (or the funniest) way to understand how humans think is to try to recreate it.

Eh. To riff on Dijkstra, this is like submarine engineers saying their ultimate goal is to understand how fish swim.

Wonderfall|1 year ago

I come from a medical science background, where I studied the brain from a "traditional" neuroscience perspective (biology, pathology, anatomy, psychology and whatnot). That the best way is actually to try to recreate it is honestly how I feel whenever I read about AI advancements where the clear goal is to achieve/surpass human intelligence, something we don't fully understand yet.

“What I cannot create, I do not understand.” someone clever once said.

CamperBob2|1 year ago

The author (and Chomsky) fail to understand that LLMs (as well as human brains) are not just autoregressive models, but nonlinear autoregressive models. Put a slightly different way, you can describe LLMs as autoregressive, but only by taking liberties with the classical definition of 'autoregressive.'

The human mind is not, like ChatGPT and its ilk, a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer to a scientific question. On the contrary, the human mind is a surprisingly efficient and even elegant system that operates with small amounts of information; it seeks not to infer brute correlations among data points but to create explanations. – Noam Chomsky

It's as if Chomsky has either never heard of transformers, or doesn't understand what they do.

Before speaking a sentence, we have a general idea of what we’re going to say; we don’t really choose what to say next based on the last word. That kind of planning isn’t something that can be represented sequentially.

It's as if the author (and Chomsky) has never seen a CoT model in action.

Wonderfall|1 year ago

Author here and I welcome the feedback, but I don't really understand your point. My post is clearly not dismissive of efforts to make LLMs reason using CoT prompting techniques and post-training, and I think such efforts are even mentioned. The model remains autoregressive either way, and this reasoning is not some kind of magic that makes them behave differently - these improvements only make them perform (much) better on given tasks.

Additionally, I'm not dismissive of the non-linear nature of transformers which I'm familiar with. Attention mechanism is a lot more complex than a linear relationship between the prediction and the past inputs, yes. But the end result remains sequential prediction. Ironically, diffusion models are kind of the opposite: sequential internally, parallel prediction at each step.

(Note: added note on terminology since the confusion arised by my use of "linearity", which was not referring to the attention mechanism itself. I've read so many papers that are perfectly fine with the use of "autoregressive" for this paradigm that I forgot some people coming from traditional statistics may be confused. Also "based on the last word" was wrong and meant "last words" or "previous words", obviously.)

All that being said, I don't think it's fair to say one doesn't understand how transformers work solely because of semantic interpretation. I appreciate the feedback though!

DeathArrow|1 year ago

>Most generative AI models nowadays are autoregressive.

There are also diffusion based models which don't rely on next token prediction.

Wonderfall|1 year ago

Yeah, this is mentioned in the article. (The LLaDa paper is even what triggered its writing!)

suddenlybananas|1 year ago

>Isn’t language by itself linear.

We've known that language is hierarchal, not linear for hundreds of years at this point.

Wonderfall|1 year ago

I guess semantics matter. Language is primarily hierarchical, but its presentation is what's linear. And LLMs mainly learn and work from this presentation; the question is, and one of the main points, whether emerging patterns is enough evidence to show that there's hierarchical thinking.

aithrowawaycomm|1 year ago

> You can say LLMs are fundamentally dumb because of their inherent linearity. Are they? Isn’t language by itself linear (more precisely, the presentation of it)?

Any linearity (or at least partial ordering) of intelligence comes from time and causality, not language - in fact the linearity of language is a limitation human cognition struggles to fight against.

I think this is where "chimpanzees are intelligent" comes to the rescue - AI has a nasty habit of focusing too much on humans. It is vacuous to think that chimpanzee intelligence can be reduced to a linear sequence of oohs-and-aahs, although I suspect a transformer trained on thousands of hours of chimp vocalizations could keep a real chimp busy for a long time. Ape cognition is much deeper and more mysterious: imperfect "axioms" and "algorithms" about space, time, numbers, object-ness, identifying other intelligences, etc, seem to be somehow built-in, and all apes seem to share deep cognitive tools like self-reflection, estimating the cognitive complexity of a task, robust quantitative reasoning, and so on. Nor does it really make sense to hand-wave about "evolutionary training data" - there are stark micro- and macro-architectural differences between primate brains and squirrel brains. Not to mention that all species have the exact same amount of data - if it was just about millions of years, why are bees and octopi uniquely intelligent among invertebrates? Why aren't there any chimpanzee-level squirrels? Rather than twisting into knots about "high quality evolutionary data," it makes a lot more sense to point towards evolution pressuring the development of different brain architectures with stronger cognitive abilities. (Especially considering how rapidly modern human intelligence seems to evolved - much more easily explained by sudden favorable mutations vs stumbling into an East African data treasure trove.)

Human intelligence uses these "algorithms" + the more modern tool of language to reason about the world. I believe any AI system which starts with language and sensory input[1], then hopes to get causality/etc via Big Data is doomed to failure: it might be an exceptionally useful text generator/processor but there will be infinite families of text-based problems that toddlers can solve but the AI cannot.

[1] I also think sight-without-touch is doomed to failure, especially with video generation, but that's a different discussion. And AIs can somewhat cheat "touch" if they train extensively on a good video game engine (I see RDR2 is used a lot).

eldenring|1 year ago

> The context window can be compared to working memory in humans: it’s fast, efficient but gets rapidly overloaded. Humans manage this limitation by offloading previously learned information into other memory forms, whereas LLMs can only mimic this process superficially at best.

This is just silly. Humans forget things all the time! If I want to remember something I write it down.

> The nature of hallucination is very different between AR models and humans, as one has a world model and the other doesn’t.

I stopped reading at this point. There's not much signal here, just basic facts about LLMs and then leaps to very bold statements.

Here is an interesting experiment I use to help people understand next token prediction. Think of a simple math problem in your head, maybe 3 digit by 2 digit multiplication. Then speak out every single thought you have while solving it.

raylad|1 year ago

Do you think in words when you do a 3 x 2 digit multiplication?

I do it all in images and I think many other people do too.

Wonderfall|1 year ago

> There's not much signal here, just basic facts about LLMs and then leaps to very bold statements.

The article wasn't supposed to be informative for people who already know how LLMs work. Like the title said, just wanted to write down some thoughts.

> This is just silly. Humans forget things all the time! If I want to remember something I write it down.

The opposite was never stated. Human memory is of course selective.

> Here is an interesting experiment I use to help people understand next token prediction. Think of a simple math problem in your head, maybe 3 digit by 2 digit multiplication. Then speak out every single thought you have while solving it.

Now a point I'm happy to discuss! The process of solving it is actually quite autoregressive-like, but this is also an example of a common pitfall with LLMs: they purely rely on pattern matching because they don't have the internal representation of what they really deal with (algebra). But we all know that.

The main question is whether LLMs taught to reason actually show that they have this kind of representation. They still work very differently I'd say; even for tasks that seem trivial to humans, reasoning LLMs will make a lot of mistakes before arriving at a plausible-sounding result. Because it was trained to reason, there's a higher chance now that the plausible-sounding result is actually correct. But this property is actually quite interesting once applied to complex tasks that would take too much time and overwhelming for humans, and that's where they shine as powerful tools.