top | item 42999573

(no title)

Andrej's video is great but the explanation on the RL part is a bit vague to me. How exactly do we train on the right answers? Do we collect the reasoning traces and train on them like supervised learning or do we compute some scores and use them as a loss function ? Isn't the reward then very sparse? What if LLMs can't generate any right answers cause the problems are too hard?

Also how can the training of LLMs be parallelized when updating parameters are sequential? Sure we can train on several samples simultaneously, but the parameter updates are with respect to the first step.

discuss

fenomas|1 year ago

As I understood that part, in RL for LLMs you take questions for which the model already sometimes emits correct answers, and then repeatedly infer while reinforcing the activations the model made during correct responses, which lets it evolve its own ways of more reliably reaching the right answer.

(Hence the analogy to training AlphaGo, wherein you take a model that sometimes wins games, and then play a bunch of games while reinforcing the cases where it won, so that it evolves its own ways of winning more often.)

quantumspandex|1 year ago

AlphaGo seems more like an automated process to me because you can start from nothing except the algorithm and the rules. Since a Go game only has 2 outcomes most of the time, and the model can play with itself, it is guaranteed to learn something during self-play.

In the LLM case you have to have an already capable model to do RL. Also I feel like the problem selection part is important to make sure it's not too hard. So there's still much labor involved.

mtkd|1 year ago

Details on how DS used GRPO for RL rewards

https://medium.com/@sahin.samia/the-math-behind-deepseek-a-d...

quantumspandex|1 year ago

Thanks!

epr|1 year ago

https://arxiv.org/abs/1707.06347

quantumspandex|1 year ago

Will have a look. Thanks!