top | item 44157594

(no title)

t55 | 9 months ago

> prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling

does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets?

discuss

yorwba|9 months ago

No, they do not point to any specific examples of novel reasoning strategies that were uncovered, nor is their sampling that extensive (at most 256 samples vs the 2048 used in https://limit-of-rlvr.github.io/ ).

grad62304977|8 months ago

Seems unreasonable to say that in figure 5 for example, that more sampling (of a reasonable amount) would push the base to 100%

t55|9 months ago

so you think it's fake news? another example of a paper with strong claims without much evidence?