top | item 39335871

(no title)

Yep, DPO is not technically “RL” and implicitly uses the LLM itself as a reward model, but training with DPO is far more stable for that reason.

discuss

espadrine|2 years ago

DPO is as close to RL as RLHF. The latter also uses the LLM as a reward model.

I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Still, what the code does isn't what is described in the paper that the page links to.

nextaccountic|2 years ago

> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Isn't this just because reinforcement learning and supervised learning are both optimization problems?

patelajay285|2 years ago

I tend to agree @espadrine, it's semantics for the most part