top | item 47163088

(no title)

317070 | 5 days ago

As an expert in the field: this is exactly right.

LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.

discuss

order

justinator|5 days ago

where do you get these books?

honking intensifies

WHERE DO YOU GET THESE BOOKS?!

tasuki|5 days ago

The local library.

benterix|5 days ago

We do things, but it doesn't feel right

fc417fc802|5 days ago

Can anyone even say what a book really is at the end of the day? It's such an abstract concept. /s

TuringTest|5 days ago

Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?

317070|5 days ago

There are many ways to model how the model works in simpler terms. Next-word prediction is useful to characterize how you do inference with the model. Maximizing mutual information, compressing, gradient descent, ... are all useful characterisations of the training process.

But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.

margalabargala|4 days ago

Everything is the same as everything else. It's all just hydrogen and time mixed together.