top | item 47163289

(no title)

Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?

discuss

317070|5 days ago

There are many ways to model how the model works in simpler terms. Next-word prediction is useful to characterize how you do inference with the model. Maximizing mutual information, compressing, gradient descent, ... are all useful characterisations of the training process.

But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.

margalabargala|5 days ago

Everything is the same as everything else. It's all just hydrogen and time mixed together.