top | item 45416669

(no title)

crthpl | 5 months ago

The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.

discuss

mbesto|5 months ago

> SWE-bench, and many other AI benchmarks, have lots of eval noise

SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.

> where there is no clear right answer

This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.

mrshu|5 months ago

Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?