Another problem with the title: the article is about DPO, which doesn’t do reinforcement learning. So not RLHF. I guess RLHF has more of a name recognition than DPO.
This was discussed in another comment, DPO is pretty much strictly better than RLHF + PPO, and far more stable when training. Yes, DPO is not technically "RL", but it's semantics for the most part. DataDreamer does support PPO training if you want, but it's so unstable, it's a less popular choice now.
In the DPO paper linked from the OP page, DPO is described as "a simple RL-free algorithm for training language models from preferences." So as you say, "not technically RL."
Given that, shouldn't the first sentence on the linked page end with "...in a process known as DPO (...)" ? Ditto for the title.
It sounds like you're saying that the terms RL and RLHF should subsume DPO because they both solve the same problem, with similar results. But they're different techniques, and there are established terms for both of them.
> DPO is pretty much strictly better than RLHF + PPO
Out of genuine curiosity, do you have any pointers/evidence to support this. I know that some of the industry leading research labs haven't switched over to DPO yet, in spite of the fact that DPO is significantly faster than RLHF. It might just be organizational inertia, but I do not know. I would be very happy if simpler alternatives like DPO were as good as RLHF or better, but I haven't seen that proof yet.
janalsncm|2 years ago
patelajay285|2 years ago
antonvs|2 years ago
Given that, shouldn't the first sentence on the linked page end with "...in a process known as DPO (...)" ? Ditto for the title.
It sounds like you're saying that the terms RL and RLHF should subsume DPO because they both solve the same problem, with similar results. But they're different techniques, and there are established terms for both of them.
vvrm|2 years ago
Out of genuine curiosity, do you have any pointers/evidence to support this. I know that some of the industry leading research labs haven't switched over to DPO yet, in spite of the fact that DPO is significantly faster than RLHF. It might just be organizational inertia, but I do not know. I would be very happy if simpler alternatives like DPO were as good as RLHF or better, but I haven't seen that proof yet.