(no title)
Extasia785 | 1 year ago
Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.
Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.
njndtu|1 year ago
What is interesting is the following paragraph in the post " With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. " So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.
I came across one of the ioi questions this year that I had trouble solving (I am pretty noob tho) which made me curious about how these reported results were reflected. The question at hand being https://github.com/ioi-2024/tasks/blob/main/day2/hieroglyphs... Apparently, the model was able to get it partially correct. https://x.com/markchen90/status/1834358725676572777