top | item 36302048

Why are LLMs general learners?

64 points| pgspaintbrush | 2 years ago |intuitiveai.substack.com

61 comments

order

Paul-Craft|2 years ago

This seems like just another way of saying that when you train an LLM on a text, its weights incorporate the tokens in that text, which is nothing really profound.

I think the real magic here comes from the fact that LLMs are a specialized sort of neural network, and that neural networks are universal approximators [0]. In other words, LLMs are general learners because they are neural networks.

This is also not particularly profound, except that there are mathematical proofs of the universal approximation theorem that give us insight into why it must be so.

---

[0]: https://en.wikipedia.org/wiki/Universal_approximation_theore...

tarvaina|2 years ago

The ingredients you need for training a useful machine learning model are expressivity, learnability, and generalization. Many methods are universal approximators but that only takes care of the first ingredient. Arguably the reason neural networks are so successful is that they can offer a good balance between the three.

Before transformers we built different neural network architectures for each domain. These architectures offered better inductive biases for their respective domains and thus traded off some of the expressivity for better learnability and generalization.

Nowadays the best architectures seem to be merging towards transformers. They appear to offer more generally useful inductive biases and thus a better trade-off between the three ingredients than the earlier architectures.

im3w1l|2 years ago

A lot of universal approximators are piss poor at general learning. It's taken a lot of hard work and clever people to get LLM's to where they are. It's not as simple as neural network and done.

dgreensp|2 years ago

LLMs are not particularly good at arithmetic, counting syllables, or recognizing haikus, though, because (contrary to the thesis of the article) they don’t magically acquire whatever ability would “simplify” predicting the next token.

I don’t feel like the points made here align with any insight about the workings of LLMs. The fact that, as a human, I “wouldn’t know where to start” when asked to add two numbers without doing any addition doesn’t apply to computers (running predictive models). They would start with statistics over lots of similar examples in the training data. It’s still remarkable LLMs do so well on these problems, while at the same time doing somewhat poorly because they can’t do arithmetic!

pgspaintbrush|2 years ago

Author here. First off, thank you for reading and for your thoughts. I provided examples that I thought would be intuitive for humans to help folks understand that an understanding of the underlying phenomena is useful for next token prediction (I've added this as a note). Could you share what part of the article came across as suggesting that LLMs "magically" acquire whatever ability helps them to predict? I'd like to make that section clearer, so that doesn't come across.

Re: "LLMs are not particularly good at arithmetic". There are published results that show that LLMs using certain techniques reach close to 100% accuracy on 8-digit addition: https://arxiv.org/pdf/2206.07682.pdf. There are also recent results from OpenAI where their model obtained solid results on high school math competition problems, which are harder than arithmetic: https://openai.com/research/improving-mathematical-reasoning... I haven't looked into counting syllables or recognizing haikus but I bet that this is a result of tokenization and not an inability of the model to create a representation of the underlying phenomena.

iliane5|2 years ago

> LLMs are not particularly good at arithmetic, counting syllables, or recognizing haikus

I suspect most of this is due to tokenization making it difficult to generalize these concepts.

There are some weird edge cases though, for example GPT-4 will almost always be able to add two 40 digits number but it is also almost always wrong when adding a 40 digit and 35 digit number.

Terr_|2 years ago

> LLMs are not particularly good at arithmetic

I'm reminded of "Benny's Rules", where someone sat down with a "self-directed" 6th grader of high IQ who had been doing okay in math classes... but their success so far was actually based on painstakingly constructing somewhat-lexical rules about "math", mumbo-jumbo that had been just good enough to carry them through a lot of graded tests.

> Benny believed that the fraction 5/10 = 1.5 and 400/400 = 8.00, because he believed the rule was to add the numerator and denominator and then divide by the number represented by the highest place value. Benny was consistent and confident with this rule and it led him to believe things like 4/11 = 11/4 = 1.5.

> Benny converted decimals to fractions with the inverse of his fraction-to-decimal rule. If he needed to write 0.5 as a fraction, "it will be like this ... 3/2 or 2/3 or anything as long as it comes out with the answer 5, because you're adding them" (Erlwanger, 1973, p. 50).

[0] https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...

8note|2 years ago

I'm surprised they're bad at predicting haikus.

I assume because there's little documentation about how many syllables every word has on the internet?

ipnon|2 years ago

Transformers don’t predict next tokens, right? They predict sequences based on their self-attention to some preceding token sequence?

Ireallyapart|2 years ago

> LLMs are not particularly good at arithmetic, counting syllables, or recognizing haikus, though, because (contrary to the thesis of the article) they don’t magically acquire whatever ability would “simplify” predicting the next token.

LLMs understand it to a certain extent. It's more then "predicting" the next token. When people ascribe "predicting the next token" it's a niave and unintelligent description to cover up what they don't understand.

I mean you can describe a human brain as simply wetware, a jumble of signals and chemical reactions that twitch muscles and react to pressure waves in the air and light. But obviously there is a higher level description of the human brain that is missing from that description.

The same thing could be said about LLMs. I can tell you this, researchers completely understand token prediction that much can be said. What we don't currently understand is the high level description. Perhaps it's not something we can understand as we've never been able to understand human consciousness at a high level either.

That's the thing with people. Nobody actually understands the high level description of a fully trained LLM. People are lambasting others because they "think" they understand when they only actually understand the low level primitives. We understand assembly, but you don't understand the Operating system written in assembly.

Take this for example:

     Me: 4320598340958340958340953095809348509348503480958340958304985038530495830 + 1
     chatGPT: 4320598340958340958340953095809348509348503480958340958304985038530495830 + 1 equals 4320598340958340958340953095809348509348503480958340958304985038530495831.
The chances of chatGPT memorizing or even predicting the next tokens here are in a probability too low to even consider. There are so many possible numbers here even numbers that aren't true but have a "higher probability" of being close to the truth from a token/edit-distance standpoint. It's safe to say, from a scientific standpoint, chatGPT in this scenario understands what it means to add 1.

Realize that this calculation results in an overflow. chatGPT needs symbolic understanding to perform the feat it did above.

But there are, of course, things it gets wrong. But again we don't truly understand what's going on here. Is it lying to us? Perhaps it can't differentiate between just a generated statistical token or a actual math equation. It's hard to say. But from the example above, by probability, we know that an aspect of true understanding and ability exists.

HALtheWise|2 years ago

I don't see enough discussion of the fact that LLMs are actually trained with two losses: text prediction and a regularization loss of some sort that effectively encourages the network to use "simple" internal structure. That means the training process isn't only trying to predict the next token, it's specifically trying to find the simplest explanation that predicts the next token.

Given that the history of science is mostly driven by trying to find the simplest explanation for observed phenomenon, thinking about regularization makes it much less surprising that LLMs end up learning how the world "actually works".

mxkopy|2 years ago

> Yet, they demonstrate a crucial point: a deeper understanding of reality simplifies next-token prediction tasks.

I'm not sure LLMs are trained to simplify anything. They have billions of parameters after all.

dTal|2 years ago

They "simplify" the training data, which they are vastly smaller than. LLMs are like compression algorithms. You could imagine feeding the training data back in, letting it guess the next token, and entropy coding the residual - this would result in an excellent compression ratio. This compression performance is a direct consequence of abstract features of the dataset that it has managed to encode - knowing that the capital of France is Paris allows you to make predictions about many sentences, not just "The capital of France is...".

braindead_in|2 years ago

It's so mind-boggling to think that our everyday reality can be encoded as weights and biases in a giant matrix. Maybe we are just weights and biases.

freecodyx|2 years ago

the main thing about LLM's in my opinion is the tokenization part, words are already clustered and converted into numbers(vectors) it's already a big deal. we are using learned weights, the attention part feels like a brute force approach to learn how those vectors are likely used together (if you add positional encoding as an additional information).

statistics on large amount of amount of data just seems to work after all.

sanxiyn|2 years ago

This is wrong, byte-level models work fine, even if not as well as word-level models. From comparison of byte-level models and word-level models, we know tokenization part is responsible for minuscule part of performance.

courseofaction|2 years ago

Intuitively, I think this also hints at why LLMs get more prone to confusion when trained to be "safe" - the underlying representations for applying human morality in context are much more complex to learn than simpler but potentially psychopathic logic.

clarge1120|2 years ago

This sounds correct. Humans are highly fickle and contradictory when it comes to morality. Even the Golden Rule is hotly contested. LLMs lose touch with reality as they try to navigate humanity’s moral landscape. Our current solution is to align an LLM to a worldview.

The good news is that this will pit one LLM against others, and virtually eliminate any potential for a single powerful AI to emerge and do something harmful.

sandsnuggler|2 years ago

Why do people keep saying its good at math when we have no clue about the training data, and all they do is insert some examples in an unscientific way in a program we have no clue what's behind it or whether its one system or even multiple.

IIAOPSW|2 years ago

Because the language we are teaching it is sophisticated enough to embed a Turing Machine?

kypro|2 years ago

I'm not sure I'm personally convinced LLMs are bad at arithmetic, I think they might just approach it differently to us.

Something you'll find if you ever train a neural network to learn a mathematical function is that it will only ever approximate that function. It won't try to guess what the function is exactly like a human might do.

For example consider, f(1) = 2, f(2) = 4, f(3) = 6, f(4) = 8, f(5) = 10.

As a human you know how important precision is in maths and you know generally humans like round numbers so you naturally assume that, f(x) = x2

Neural networks don't have these biases by default. They'll look for a function that gets close enough maybe something like, f(x) = x1.993929910302942223

From a neural network's perspective the loss between this answer and the actual answer is almost so trivial that it's basically irrelevant.

Then a human who likes round numbers comes along and asks the network, what's f(1,000)? To which the neural network replies, 19939.3

Then the human then goes away convinced the AI doesn't know maths, when in reality the AI basically does know maths, it just doesn't care as much about aromatic precession as the human does. Because again, to the AI 19939.3 is a perfectly acceptable answer.

So now for fun let me ask ChatGPT some arithmetic questions...

> ME

> what's 2343423 + 9988733?

> ChatGPT

> The sum of 2343423 and 9988733 is 12392156.

WRONG! It's actually 12332156. That's an entire digit out and almost 0.5% larger than the actual answer!

> ME

> what is 8379270 + 387299177?

> ChatGPT

> The sum of 8379270 and 387299177 is 395678447.

Er, okay, that was right. Bad example, let me try again.

> ME

> what is 2233322223333 + 387299177?

> ChatGPT

> The sum of 2233322223333 and 387299177 is 2233322610510.

WRONG! It's actually 2233709522510. That's 6 digits out and almost 0.02% smaller than the actual answer!

If you take a more open minded view I think it's fair to say ChatGPT basically does know arithmetic, but its reward function probably didn't prioritise arithmetic precision in the same way a decade of schooling does for us humans. For ChatGPT having a few digits wrong in an arithmetic problem is probably less important that its reply containing that sum being slightly improperly worded.

I guess what I'm saying is that I'm not sure I quite agree with the author that LLMs don't do arithmetic at all. It's not that they're trying to guess the next word without arithmetic, but more that they're not doing arithmetic the same as we humans do it. Which is may have been the point the author was making... I'm not really sure.

SkyPuncher|2 years ago

LLMs are bad at math because they don't actually understand the rules of math.

They can write code to do math, but without code they can only estimate how likely a series of numbers are to be seen together.

They're very likely to get things like 2+2=4 correct because that's probably unique and common in their training data. They're unlikely to get two random numbers correct because it doesn't actually know what those numbers mean.

Tainnor|2 years ago

The exactness matters, though. Unless you'd like things like encryption to stop working.