top | item 44157780

(no title)

sadboots | 9 months ago

by the love of god, please stop overfitting on gsm8k

discuss

Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!

i5heu|9 months ago

It looks like your neural network is overfitted on seeing overfitt where is none.

Prejudices is a form of overfitting IMHO

t55|9 months ago

agree, the RG evals feel like a fresh breeze