top | item 44937812

(no title)

thinkzilla | 6 months ago

While the post uses DPO to illustrate RL and RLHF, in fact DPO is an alternative to RLHF that does not use RL. See the abstract of the DPO paper https://arxiv.org/abs/2305.18290, and Figure 1 in the paper: "DPO optimizes for human preferences while avoiding reinforcement learning".

The confusion is understandable. The definition of RL in the Sutton/Barto book extends over two chapters iirc, and after reading it I did not see how it differed from other learning methods. Studying some of the academic papers cleared things up.

discuss

krackers|6 months ago

I think there was some quote from Karpathy who said that RLHF isn't actually "true" RL. As an armchair person, even after trying to understand it RLHF always seemed so roundabout. You don't have some open ended environment, you already have a fixed set of preferences. Instead of directly optimizing the model against that like DPO does, RLHF goes out of its way to train value/reward networks encoding these preferences then optimizing against that. I assumed that maybe it was just done this way for performance or stability or some other math -heavy reason, it was good to see that my suspicion was not off-base.