top | item 46609822

(no title)

A few things I think are useful to emphasize on the training side, beyond what the article says:

1. Pre-training nowadays do not just use the 'next word/token' as a training signal, but also the next N words, because that appears to teach the model more generalized semantics and also bias towards 'thinking ahead' behaviors (gimme some rope here, i dont remember the precise way it should be articulated).

2. Regularizers during training, namely decay and (to a lesser extent) diversity. These do way more heavy-lifting than their simplicitly gives them credit for, they are the difference between memorizing entire paragraphs from a book and only taking away the core concepts.

3. Expert performance at non-knowledge tasks is mostly driven by RL and/or SFT over 'high quality' transcripts. The former cannot be described as 'predicting the next word', at least in terms of learning signal.

discuss

No comments yet.