top | item 44613840

OpenAI claims gold-medal performance at IMO 2025

479 points| Davidzheng | 8 months ago |twitter.com | reply

698 comments

order
[+] famouswaffles|8 months ago|reply
From Noam Brown

https://x.com/polynoamial/status/1946478258968531288

"When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."

and

"This was a small team effort led by @alexwei_ . He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community."

[+] gsf_emergency_2|8 months ago|reply
"frontier" seems to be "zapad" for OpenAI

They have a parallel effort to corner Ramanujan called https://epoch.ai/frontiermath/tier-4

(& Problem 6, combinatorics, the one class of problems not yet fallen to AI?)

The hope for humanity is that of the big names associated to FrontierMath (starkly opposite to oAI proper) Daniel is the one youngish nonexsoviet guy :)

[+] yahoozoo|8 months ago|reply
That brand new technique? Training on the test data. /s
[+] rafael859|8 months ago|reply
Interesting that the proofs seem to use a limited vocabulary: https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...

Why waste time say lot word when few word do trick :)

Also worth pointing out that Alex Wei is himself a gold medalist at IOI.

[+] throw310822|8 months ago|reply
Interesting observation. One one hand, these resemble more the notes that an actual participant would write while solving the problem. Also, less words = less noise, more focus. But also, specifically for LLMs that output one token at a time and have a limited token context, I wonder if limiting itself to semantically meaningful tokens can be create longer stretches of semantically coherent thought?
[+] beyonddream|8 months ago|reply
He is talking about IMO (math olympiad) while he got gold at IOI (informatics olympiad) :)
[+] NitpickLawyer|8 months ago|reply
> Also worth pointing out that Alex Wei is himself a gold medalist at IOI.

Terence Tao also called it, that the top LLMs would get gold this year in a recent podcast.

[+] torginus|8 months ago|reply
In transformers generating each token takes the same amount of time, regardless of how much meaning it carries. By cutting out the filler from the text, you get a huge speedup.
[+] johnpaulkiser|8 months ago|reply
Are you saying "see the world?" or "seaworld"?
[+] lukebechtel|8 months ago|reply
whoah, very very interesting / telling.
[+] tlb|8 months ago|reply
I encourage anyone who thinks these are easy high-school problems to try to solve some. They're published (including this year's) at https://www.imo-official.org/problems.aspx. They make my head spin.
[+] mauriziocalo|8 months ago|reply
Related — these videos give a sense of how someone might actually go about thinking through and solving these kinds of problems:

- A 3Blue1Brown video on a particularly nice and unexpectedly difficult IMO problem (2011 IMO, Q2): https://www.youtube.com/watch?v=M64HUIJFTZM

-- And another similar one (though technically Putnam, not IMO): https://www.youtube.com/watch?v=OkmNXy7er84

- Timothy Gowers (Fields Medalist and IMO perfect scorer) solving this year’s IMO problems in “real time”:

-- Q1: https://www.youtube.com/watch?v=1G1nySyVs2w

-- Q4: https://www.youtube.com/watch?v=O-vp4zGzwIs

[+] bko|8 months ago|reply
I like watching youtube videos solving these problems. They're deceptively simple. I remember reading one:

x+y=1

xy=1

The incredible thing is the explanation uses almost all reasoning steps that I am familiar with from basic algebra, like factoring, quadratic formula, etc. But it just comes together so beautifully. It gives you the impression that if you thought about it long enough, surely you would have come up with the answer, which is obviously wrong, at least in my case.

https://www.youtube.com/watch?v=csS4BjQuhCc

[+] xpressvideoz|8 months ago|reply
I didn't know there were localized versions of the IMO problems. But now that I think of it, having versions of multiple languages is a must to remove the language barrier from the competitors. I guess having that many language versions (I see ~50 languages?) may make keeping the security of the problems considerably harder?
[+] koakuma-chan|8 months ago|reply
How do those compare to leetcode hard problems?
[+] kappi|8 months ago|reply
[flagged]
[+] hislaziness|8 months ago|reply
[+] dlubarov|8 months ago|reply
It's a good point - IMO is about performance under some specific resource constraints, and those constraints don't make sense for AIs. But I wonder how far we are from an AI solving a well-studied unsolved math problem. That would be more of a decisive "quantum supremacy" type milestone.
[+] xpressvideoz|8 months ago|reply
> there will be a proposal at some point to actually have an AI math Olympiad where at the same time as the human contestants get the actual Olympiad problems, AI’s will also be given the same problems, the same time period and the outputs will have to be graded by the same judges, which means that it’ll have be written in natural language rather than formal language.[1]

Last month, Tao himself said that we can compare humans and AIs at IMO. He even said such AI didn't exist yet and AIs won't beat IMO in 2025. And now that AIs can compete with humans at IMO under the same conditions that Tao mentioned, suddenly it becomes an apples-to-oranges comparison?

[1] https://lexfridman.com/terence-tao-transcript/

[+] darkoob12|8 months ago|reply
He is basically asking OpenAI to publish their methodology so we can understand the real state of AI in solving math problems.
[+] dylanbyte|8 months ago|reply
These are high school level only in the sense of assumed background knowledge, they are extremely difficult.

Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.

This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.

The answers are not in the training data.

This is not a model specialized to IMO problems.

[+] Davidzheng|8 months ago|reply
Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.
[+] torginus|8 months ago|reply
From my vague rememberance of doing data science years ago, it's very hard not to leak the training set.

Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.

What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.

Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.

[+] YeGoblynQueenne|8 months ago|reply
>> This is not a model specialized to IMO problems.

How do you know?

[+] aprilthird2021|8 months ago|reply
> The answers are not in the training data.

> This is not a model specialized to IMO problems.

Any proof?

[+] gniv|8 months ago|reply
From that thread: "The model solved P1 through P5; it did not produce a solution for P6."

It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.

[+] gus_massa|8 months ago|reply
In the IMO, the idea is that the first day you get P1, P2 and P3, and the second day you get P4, P5 and P6. Usually, ordered by difficulty, they are P1, P4, P2, P5, P3, P6. So, usually P1 is "easy" and P6 is very hard. At least that is the intended order, but sometime reality disagree.

Edit: Fixed P4 -> P3. Thanks.

[+] demirbey05|8 months ago|reply
I think from Canada team someone solved it but among all, its very few
[+] bwfan123|8 months ago|reply
To me, this is a tell of human-involvement in the model solution.

There is no reason why machines would do badly on exactly the problem which humans do badly as well - without humans prodding the machine towards a solution.

Also, there is no reason why machines could not produce a partial or wrong answer to problem 6 which seems like survivor bias to me. ie, that only correct solutions were cherrypicked.

[+] modeless|8 months ago|reply
Noam Brown:

> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.

I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.

https://x.com/polynoamial/status/1946478249187377206

[+] johnecheck|8 months ago|reply
Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.

[+] demirbey05|8 months ago|reply
Progress is astounding. Recently report published about evaluation of LLMs on IMO 2025. o3 high didn't even get bronze.

https://matharena.ai/imo/

Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.

[+] nmca|8 months ago|reply
It’s interesting that this is a competition elite enough that several posters on a programming website don’t seem to understand what it is.

My very rough napkin math suggests that against the US reference class, imo gold is literally a one in a million talent (very roughly 20 people who make camp could get gold out of very roughly twenty million relevant high schoolers).

[+] meroes|8 months ago|reply
In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.
[+] z7|8 months ago|reply
Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...

[+] quirino|8 months ago|reply
I think equally impressive is the performance of the OpenAI team at the "AtCoder World Tour Finals 2025" a couple of days ago. There were 12 human participants and only one did better than OpenAI.

Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.

[+] ksec|8 months ago|reply
I am neither an optimist nor a pessimist for AI. I would likely be called both by the opposite parties. But the fact that AI / LLM is still rapidly improving is impressive in itself and worth celebrating for. Is it perfect, AGI, ASI? No. Is it useless? Absolutely not.

I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.

[+] mehulashah|8 months ago|reply
The AI scaling that went on for the last five years is going to be very different from the scaling that will happen in the next ten years. These models have latent capabilities that we are racing to unearth. IMO is but one example.

There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.

More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.

[+] gitfan86|8 months ago|reply
This is such an interesting time because the percentage of people who are making predictions about AGI happening on the future are going to drop off and the number of people completely ignoring the term AGI will increase.
[+] reactordev|8 months ago|reply
The Final boss was:

   Which is greater, 9.11 or 9.9?

/s

I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.

[+] amelius|8 months ago|reply
If someone told me this say, 10 or 20 years ago, I would have assumed this was worthy of a Nobel/Turing prize ...