(no title)
t55 | 9 months ago
> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?
it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training
this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864
No comments yet.