(no title)
lennxa | 1 year ago
how practical do you think grpo is? (for most people)
here's my thoughts - grpo starts off slow, with super small loss (likely because the rewards on all observations are the same) - as you mentioned, some sft on reasoning data ought to help speed things up - unless you're a lab with a gazillion gpus, wouldn't you be better off taking your non-reasoning dataset and converting it into a high quality reasoning dataset using frontier models (maybe deepseek)? could grpo be cheaper or better accuracy? - maybe you do tons of sft and when you've reached the frontier models' perf on your task, then perhaps grpo could help more exploration
would be great to hear your thoughts
danielhanchen|1 year ago