(no title)
ferris-booler | 8 hours ago
An enormous amount of research+eng work (most of the work of frontier labs) is being poured into making that 'correct' modifier happen, rather than just predicting the next token from 'the internet' (naive original training corpus). This work takes the form of improved training data (e.g. expert annotations), human-feedback finetuning (e.g. RLHF), and most recently reinforcement learning (e.g. RLVR, meaning RL with verifiable rewards), where the model is trained to find the correct answer to a problem without 'token-level guidance'. RL for LLMs is a very hot research area and very tricky to solve correctly.
No comments yet.