(no title)
oldsecondhand | 1 year ago
It would be cool to have logic based modeling jobs, even if the goal is just to feed the LLMs.
oldsecondhand | 1 year ago
It would be cool to have logic based modeling jobs, even if the goal is just to feed the LLMs.
andoando|1 year ago
People keep calling it "next next token predictors", but clearly there is something more going on and I would love for someone to give a simple explanation.
famouswaffles|1 year ago
Next token prediction is the objective function. The model is asked to predict the next word yes but it's also allowed to compute the answer and more importantly, the entire training process is supposed to be the model learning and figuring out what sort of computations aid the prediction of the corpus it's trained on.
If your corpus is language A followed by the translation in Language B then there's little choice but for the model to learn computations that translate as loss goes down.
Is your corpus is chess moves then again, it's going to have to learn how to compute chess games to reduce loss.
You can see this with toy models trained on toy problems. Example - a tiny transformer trained on addition examples - x + y = z learning an algorithm for addition.
https://cprimozic.net/blog/reverse-engineering-a-small-neura...
"Pick the right word" is not a trivial exercise for the vast majority of text data.
And again because people often make this mistake but a LLMs ultimate objective is NOT to produce "text that looks right" but "text that is right". Of course "right" as determined by the training corpus but basically anytime it picks a wrong word is opportunity for the model to learn and learn it does.
drdeca|1 year ago
I think this depends what you mean by "something more going on".
Now, if someone says that it is "just" "next token prediction", in a dismissive way, I think that's an error.
But, while they RLHF ones aren't exactly trained just to match the observed distribution, but rather are trained with the RLHF objective, it is nonetheless true that the model produces a probability distribution over possible next tokens, conditioned on the previous tokens, and samples from that. (I suppose there's also like, things done as part of the sampling on top of these conditional probabilities, rather than just sampling according to the probabilities given the temperature. (I don't know how this part works really.) But I think this is mostly just a trick to get a little more quality, and not a major part of how it behaves? Not part of the NN itself in any case.)
HarHarVeryFunny|1 year ago
Starting from a point of outputting random gibberish, the only feedback these models are given during training is whether their next word prediction was right or wrong (i.e. same as next word in the training sample they are being fed). So, calling these models "next word predictors" is technically correct from that point of view - this is their only "goal" and only feedback they are given.
Of course, what these models can accomplish, reflecting what they have learnt, is way more impressive than what one might naively expect from such a modest goal.
The simple, usual, and rather inadequate, explanation for this mismatch between training goal and capability is that in order to get really, REALLY, good at "predict next word", you need to learn to understand the input, extremely well. If the input is "1+2=" then the model needs to have learnt math to predict next word and get it right. If the input is a fairy tale, then it needs to learn to recognize that, and learn how to write fairy tales.
This is how these LLM's "predict next word" goal turns into a need for them to learn "everything about everything" in order to minimize their training error.
The question of course then becomes how do they do it? We are training them on pretty much everything on the internet, so plenty to learn from, but only giving them some extremely limited feedback ("no, that's not the correct next word"), so what magic is inside them that let's them learn so well?!
Well, the magic is a "transformer", a specific (and surprisingly simple) neural network architecture, but this is pretty much where the explanation ends. It's relatively easy to describe what a transformer does - e.g. learning which parts of it's input to pay attention to when predicting next word, and doing this in a very flexible way using "keys" that it learns and can search for in the input, but it is extremely hard to explain how this mechanism let's it learn what it does. Interpreting what is really going on inside a transformer is an ongoing research area.
I think that maybe the best that can be said is that the transformer designers stumbled upon (I'm not sure they were predicting ahead of time how powerful it would be) an extremely powerful and general type of sequence processor, and one that appears to be very well matched to how we ourselves generate and recognize language. Maybe there is some insight to be learnt there in terms of how our own brains work.
ijk|1 year ago
It's not a priority for the current big model architecture, but there's a bunch of stuff we could be doing with network architecture.