I'd also be highly wary of the method they used because of statements like this:
>we note that the vast majority of its answers simply stated the final answer without additional justification
While the reasoning steps are obviously important for judging human participant answers, none of the current big-game providers disclose their actual reasoning tokens. So unless they got direct internal access to these models from the big companies (which seems highly unlikely), this might be yet another failed study designed to (of which we have seen several in recent months, even by serious parties).
The last time someone claimed a medal in an olympiad like this, turned out they manually translated the problem into Lean and then ran a brute force search algorithm to find a proof. For 60 hours. On a supercomputer.
Meanwhile high schoolers get a piece of paper and 4.5 hours.
raincole|7 months ago
This OP claims the publicly available models all failed to get Bronze.
OpenAI tweet claims there is an unreleased model that can get Gold.
sigmoid10|7 months ago
>we note that the vast majority of its answers simply stated the final answer without additional justification
While the reasoning steps are obviously important for judging human participant answers, none of the current big-game providers disclose their actual reasoning tokens. So unless they got direct internal access to these models from the big companies (which seems highly unlikely), this might be yet another failed study designed to (of which we have seen several in recent months, even by serious parties).
dmitrygr|7 months ago
bgwalter|7 months ago
We'll never know how many GPUs and other assistance (like custom code paths) this model got.
untitled2|7 months ago
JohnKemeny|7 months ago
Meanwhile high schoolers get a piece of paper and 4.5 hours.
changoplatanero|7 months ago
kenjackson|7 months ago