top | item 41528385

(no title)

> For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.

Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.

discuss

njndtu|1 year ago

When I read that line too I was very confused lol. I interpreted it as them saying they basically took other contestant submissions and allowing the model to see these "solutions" as part of context? and then having the model generate its own "solution" to be used for the benchmark. I fail to see how this is "solving" a ioi level question.

What is interesting is the following paragraph in the post " With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. " So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.

I came across one of the ioi questions this year that I had trouble solving (I am pretty noob tho) which made me curious about how these reported results were reflected. The question at hand being https://github.com/ioi-2024/tasks/blob/main/day2/hieroglyphs... Apparently, the model was able to get it partially correct. https://x.com/markchen90/status/1834358725676572777