top | item 46919573

(no title)

If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others:

- a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow

- cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval

- Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting

- Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set

- The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect

discuss

No comments yet.