(no title)
minimaltom | 1 month ago
1. Pre-training nowadays do not just use the 'next word/token' as a training signal, but also the next N words, because that appears to teach the model more generalized semantics and also bias towards 'thinking ahead' behaviors (gimme some rope here, i dont remember the precise way it should be articulated).
2. Regularizers during training, namely decay and (to a lesser extent) diversity. These do way more heavy-lifting than their simplicitly gives them credit for, they are the difference between memorizing entire paragraphs from a book and only taking away the core concepts.
3. Expert performance at non-knowledge tasks is mostly driven by RL and/or SFT over 'high quality' transcripts. The former cannot be described as 'predicting the next word', at least in terms of learning signal.
No comments yet.