"When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."
and
"This was a small team effort led by
@alexwei_
. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at
@OpenAI
and the wider AI community."
(& Problem 6, combinatorics, the one class of problems not yet fallen to AI?)
The hope for humanity is that of the big names associated to FrontierMath (starkly opposite to oAI proper) Daniel is the one youngish nonexsoviet guy :)
Interesting observation. One one hand, these resemble more the notes that an actual participant would write while solving the problem. Also, less words = less noise, more focus. But also, specifically for LLMs that output one token at a time and have a limited token context, I wonder if limiting itself to semantically meaningful tokens can be create longer stretches of semantically coherent thought?
In transformers generating each token takes the same amount of time, regardless of how much meaning it carries. By cutting out the filler from the text, you get a huge speedup.
I encourage anyone who thinks these are easy high-school problems to try to solve some. They're published (including this year's) at https://www.imo-official.org/problems.aspx. They make my head spin.
I like watching youtube videos solving these problems. They're deceptively simple. I remember reading one:
x+y=1
xy=1
The incredible thing is the explanation uses almost all reasoning steps that I am familiar with from basic algebra, like factoring, quadratic formula, etc. But it just comes together so beautifully. It gives you the impression that if you thought about it long enough, surely you would have come up with the answer, which is obviously wrong, at least in my case.
I didn't know there were localized versions of the IMO problems. But now that I think of it, having versions of multiple languages is a must to remove the language barrier from the competitors. I guess having that many language versions (I see ~50 languages?) may make keeping the security of the problems considerably harder?
It's a good point - IMO is about performance under some specific resource constraints, and those constraints don't make sense for AIs. But I wonder how far we are from an AI solving a well-studied unsolved math problem. That would be more of a decisive "quantum supremacy" type milestone.
> there will be a proposal at some point to actually have an AI math Olympiad where at the same time as the human contestants get the actual Olympiad problems, AI’s will also be given the same problems, the same time period and the outputs will have to be graded by the same judges, which means that it’ll have be written in natural language rather than formal language.[1]
Last month, Tao himself said that we can compare humans and AIs at IMO. He even said such AI didn't exist yet and AIs won't beat IMO in 2025. And now that AIs can compete with humans at IMO under the same conditions that Tao mentioned, suddenly it becomes an apples-to-oranges comparison?
Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.
From my vague rememberance of doing data science years ago, it's very hard not to leak the training set.
Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.
What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.
Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
From that thread: "The model solved P1 through P5; it did not produce a solution for P6."
It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.
In the IMO, the idea is that the first day you get P1, P2 and P3, and the second day you get P4, P5 and P6. Usually, ordered by difficulty, they are P1, P4, P2, P5, P3, P6. So, usually P1 is "easy" and P6 is very hard. At least that is the intended order, but sometime reality disagree.
To me, this is a tell of human-involvement in the model solution.
There is no reason why machines would do badly on exactly the problem which humans do badly as well - without humans prodding the machine towards a solution.
Also, there is no reason why machines could not produce a partial or wrong answer to problem 6 which seems like survivor bias to me. ie, that only correct solutions were cherrypicked.
> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.
> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.
I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.
Wow. That's an impressive result, but how did they do it?
Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.
Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.
It’s interesting that this is a competition elite enough that several posters on a programming website don’t seem to understand what it is.
My very rough napkin math suggests that against the US reference class, imo gold is literally a one in a million talent (very roughly 20 people who make camp could get gold out of very roughly twenty million relevant high schoolers).
In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.
I think equally impressive is the performance of the OpenAI team at the "AtCoder World Tour Finals 2025" a couple of days ago. There were 12 human participants and only one did better than OpenAI.
I am neither an optimist nor a pessimist for AI. I would likely be called both by the opposite parties. But the fact that AI / LLM is still rapidly improving is impressive in itself and worth celebrating for. Is it perfect, AGI, ASI? No. Is it useless? Absolutely not.
I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.
The AI scaling that went on for the last five years is going to be very different from the scaling that will happen in the next ten years. These models have latent capabilities that we are racing to unearth. IMO is but one example.
There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.
More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.
This is such an interesting time because the percentage of people who are making predictions about AGI happening on the future are going to drop off and the number of people completely ignoring the term AGI will increase.
I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.
[+] [-] famouswaffles|8 months ago|reply
https://x.com/polynoamial/status/1946478258968531288
"When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."
and
"This was a small team effort led by @alexwei_ . He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community."
[+] [-] gsf_emergency_2|8 months ago|reply
They have a parallel effort to corner Ramanujan called https://epoch.ai/frontiermath/tier-4
(& Problem 6, combinatorics, the one class of problems not yet fallen to AI?)
The hope for humanity is that of the big names associated to FrontierMath (starkly opposite to oAI proper) Daniel is the one youngish nonexsoviet guy :)
[+] [-] unknown|8 months ago|reply
[deleted]
[+] [-] yahoozoo|8 months ago|reply
[+] [-] rafael859|8 months ago|reply
Why waste time say lot word when few word do trick :)
Also worth pointing out that Alex Wei is himself a gold medalist at IOI.
[+] [-] throw310822|8 months ago|reply
[+] [-] beyonddream|8 months ago|reply
[+] [-] NitpickLawyer|8 months ago|reply
Terence Tao also called it, that the top LLMs would get gold this year in a recent podcast.
[+] [-] torginus|8 months ago|reply
[+] [-] johnpaulkiser|8 months ago|reply
[+] [-] lukebechtel|8 months ago|reply
[+] [-] tlb|8 months ago|reply
[+] [-] mauriziocalo|8 months ago|reply
- A 3Blue1Brown video on a particularly nice and unexpectedly difficult IMO problem (2011 IMO, Q2): https://www.youtube.com/watch?v=M64HUIJFTZM
-- And another similar one (though technically Putnam, not IMO): https://www.youtube.com/watch?v=OkmNXy7er84
- Timothy Gowers (Fields Medalist and IMO perfect scorer) solving this year’s IMO problems in “real time”:
-- Q1: https://www.youtube.com/watch?v=1G1nySyVs2w
-- Q4: https://www.youtube.com/watch?v=O-vp4zGzwIs
[+] [-] bko|8 months ago|reply
x+y=1
xy=1
The incredible thing is the explanation uses almost all reasoning steps that I am familiar with from basic algebra, like factoring, quadratic formula, etc. But it just comes together so beautifully. It gives you the impression that if you thought about it long enough, surely you would have come up with the answer, which is obviously wrong, at least in my case.
https://www.youtube.com/watch?v=csS4BjQuhCc
[+] [-] xpressvideoz|8 months ago|reply
[+] [-] koakuma-chan|8 months ago|reply
[+] [-] selfselfgo|8 months ago|reply
[deleted]
[+] [-] kappi|8 months ago|reply
[+] [-] hislaziness|8 months ago|reply
[+] [-] mananaysiempre|8 months ago|reply
[+] [-] dlubarov|8 months ago|reply
[+] [-] xpressvideoz|8 months ago|reply
Last month, Tao himself said that we can compare humans and AIs at IMO. He even said such AI didn't exist yet and AIs won't beat IMO in 2025. And now that AIs can compete with humans at IMO under the same conditions that Tao mentioned, suddenly it becomes an apples-to-oranges comparison?
[1] https://lexfridman.com/terence-tao-transcript/
[+] [-] darkoob12|8 months ago|reply
[+] [-] unknown|8 months ago|reply
[deleted]
[+] [-] dylanbyte|8 months ago|reply
Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.
This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.
The answers are not in the training data.
This is not a model specialized to IMO problems.
[+] [-] Davidzheng|8 months ago|reply
[+] [-] torginus|8 months ago|reply
Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.
What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.
Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.
[+] [-] YeGoblynQueenne|8 months ago|reply
How do you know?
[+] [-] aprilthird2021|8 months ago|reply
> This is not a model specialized to IMO problems.
Any proof?
[+] [-] AIPedant|8 months ago|reply
E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
[+] [-] demirbey05|8 months ago|reply
[deleted]
[+] [-] ktallett|8 months ago|reply
[deleted]
[+] [-] gniv|8 months ago|reply
It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.
[+] [-] gus_massa|8 months ago|reply
Edit: Fixed P4 -> P3. Thanks.
[+] [-] demirbey05|8 months ago|reply
[+] [-] bwfan123|8 months ago|reply
There is no reason why machines would do badly on exactly the problem which humans do badly as well - without humans prodding the machine towards a solution.
Also, there is no reason why machines could not produce a partial or wrong answer to problem 6 which seems like survivor bias to me. ie, that only correct solutions were cherrypicked.
[+] [-] demirbey05|8 months ago|reply
https://x.com/natolambert/status/1946569475396120653
OAI announced early, probably we will hear announcement from Google soon.
[+] [-] modeless|8 months ago|reply
> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.
> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.
I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.
https://x.com/polynoamial/status/1946478249187377206
[+] [-] johnecheck|8 months ago|reply
Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.
[+] [-] demirbey05|8 months ago|reply
https://matharena.ai/imo/
Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.
[+] [-] nmca|8 months ago|reply
My very rough napkin math suggests that against the US reference class, imo gold is literally a one in a million talent (very roughly 20 people who make camp could get gold out of very roughly twenty million relevant high schoolers).
[+] [-] meroes|8 months ago|reply
[+] [-] z7|8 months ago|reply
In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.
He thought there was an 8% chance of this happening.
Eliezer Yudkowsky said "at least 16%".
Source:
https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...
[+] [-] quirino|8 months ago|reply
Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.
[+] [-] ksec|8 months ago|reply
I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.
[+] [-] Philpax|8 months ago|reply
https://xcancel.com/OpenAI/status/1946594928945148246
https://xcancel.com/OpenAI/status/1946594933470900631
[+] [-] mehulashah|8 months ago|reply
There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.
More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.
[+] [-] gitfan86|8 months ago|reply
[+] [-] reactordev|8 months ago|reply
I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.
[+] [-] amelius|8 months ago|reply