No, DPO avoids a Reinforcement Learning training loop. For the current iteration on verifiable domains, our method is GRPO.
Let me elaborate: DPO is for preference learning - each data sample in the dataset contains 2 pieces: preferred and non-preferred responses (what the model should avoid generating). DPO optimizes for the preferred response between the 2. That means, DPO is one effective method for making a model learn sentiment or preference. We call a generalization of this alignment mode - it's on our roadmap.
On the current GRPO implementation side, dataset needs on Augento are simpler: Just the prompt, and some captured context if you like - it's then the reward function that scores the model generations.
Currently, with GRPO, training is done on verifiable domains. Not preference, but one piece of output will be judged by a deterministic reward function, or by a reward model (which the user decides - you can decide it through defining the reward function).(EDIT: Would you use DPO? Do you have experience with it or needs?)
lukasego|11 months ago