If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.
I love this sort of “anti-hype” research. We need more of it.
No comments yet.