Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.
Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!
olliestanley|9 months ago
Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!
i5heu|9 months ago
Prejudices is a form of overfitting IMHO
t55|9 months ago