DPO is as close to RL as RLHF. The latter also uses the LLM as a reward model.
I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.
Still, what the code does isn't what is described in the paper that the page links to.
> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.
Isn't this just because reinforcement learning and supervised learning are both optimization problems?
espadrine|2 years ago
I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.
Still, what the code does isn't what is described in the paper that the page links to.
nextaccountic|2 years ago
Isn't this just because reinforcement learning and supervised learning are both optimization problems?
patelajay285|2 years ago