top | item 43544418

(no title)

lukasego | 11 months ago

No, DPO avoids a Reinforcement Learning training loop. For the current iteration on verifiable domains, our method is GRPO. Let me elaborate: DPO is for preference learning - each data sample in the dataset contains 2 pieces: preferred and non-preferred responses (what the model should avoid generating). DPO optimizes for the preferred response between the 2. That means, DPO is one effective method for making a model learn sentiment or preference. We call a generalization of this alignment mode - it's on our roadmap. On the current GRPO implementation side, dataset needs on Augento are simpler: Just the prompt, and some captured context if you like - it's then the reward function that scores the model generations. Currently, with GRPO, training is done on verifiable domains. Not preference, but one piece of output will be judged by a deterministic reward function, or by a reward model (which the user decides - you can decide it through defining the reward function).

(EDIT: Would you use DPO? Do you have experience with it or needs?)

discuss

lukasego|11 months ago

To add, there is the important distinction to be made between RLHF (Reinforcement Learning with Human Feedback) and RL. DPO is a simpler and more efficient way to do RLHF. In its current iteration, Augento does RL (using the term coined by OpenAI: Reinforcement Fine-tuning) which improves model performance on domains where there exists a verification function for the answer that you can use for scoring, rather than a preferred answer such as DPO needs. But as said, such preference mode is on the roadmap.