I used to work at a drug discovery startup. A simple model generating directly from latent space 'discovered' some novel interactions that none of our medicinal chemists noticed e.g. it started biasing for a distribution of molecules that was totally unexpected for us.
Our chemists were split: some argued it was an artifact, others dug deep and provided some reasoning as to why the generations were sound. Keep in mind, that was a non-reasoning, very early stage model with simple feedback mechanisms for structure and molecular properties.
In the wet lab, the model turned out to be right. That was five years ago. My point is, the same moment that arrived for our chemists will be arriving soon for theoreticians.
A lot of interesting possibilities lie in latent space. For those unfamiliar, this means the underlying set of variables that drive everything else.
For instance, you can put a thousand temperature sensors in a room, which give you 1000 temperature readouts. But all these temperature sensors are correlated, and if you project them down to latent space (using PCA or PLS if linear, projection to manifolds if nonlinear) you’ll create maybe 4 new latent variables (which are usually linear combinations of all other variables) that describe all the sensor readings (it’s a kind of compression). All you have to do then is control those 4 variables, not 1000.
In the chemical space, there are thousands of possible combinations of process conditions and mixtures that produce certain characteristics, but when you project them down to latent variables, there are usually less than 10 variables that give you the properties you want. So if you want to create a new chemical, all you have to do is target those few variables. You want a new product with particular characteristics? Figure out how to get < 10 variables (not 1000s) to their targets, and you have a new product.
Interesting! Depending on your definition, "automated invention" has been a thing since at least the 1990's. An early success was the evolved antenna [1].
My understanding is, iterating on possible sequences (of codons, base pairs, etc) is exactly what LLMs, these feedback-looped predictor machines, are especially great at. With the newest models, those that "reason about" (check) their own output, are even better at it.
Warning the below comment comes from someone who has no formal science degree, and just enjoys reading articles on the topic.
Similar for physicists, I think there’s a very confusing/unconventional antenna called the “evolved antenna” which was used on a NASA spacecraft. The idea behind it was supported from genetic programming. The science or understanding “why” the way the antenna bends at different areas supporting increased gain is not well understood by us today.
This all boils down to empirical reasoning, which underlies the vast majority of science (or science adjacent fields like software engineering, social sciences etc).
The question I guess is; does LLMs, “AI”, ML give us better hypothesis or tests to run to support empirical evidence-based science breakthroughs? The answer is yes.
Will these be substantial, meaningful or create significant improvements on today’s approaches?
Hallucinations or inhuman intuition? An obvious mistake made by a flawed machine that doesn't know the limits of its knowledge? Or a subtle pattern, a hundred scattered dots that were never connected by a human mind?
You never quite know.
Right now, it's mostly the former. I fully expect the latter to become more and more common as the performance of AI systems improves.
Ok but I have to point out something important here. Presumably, the model you're talking about was trained on chemical/drug inputs. So it models a space of chemical interactions, which means insights could be plausible.
GPT-5 (and other LLMs) are by definition language models and though they will happily spew tokens about whatever you ask, they don't necessarily have the training data to properly encode the latent space of (e.g) drug interactions.
1. This is one example. How many other attempts did the person try that failed to be useful, accurate, coherent? The author is an OpenAI employee IIUC, so it begs this question. Sora's demos were amazing until you tried it, and realized it took 50 attempts to get a usable clip.
2. The author noted that humans had updated their own research in April 2025 with an improved solution. For cases where we detect signs of superior behavior, we need to start publishing the thought process (reasoning steps, inference cycles, tools used, etc.). Otherwise it's impossible to know whether this used a specialty model, had access to the more recent paper, or in other ways got lucky. Without detailed proof it's becoming harder to separate legitimate findings from marketing posts (not suggesting this specific case was a pure marketing post)
3. Points 1 and 2 would help with reproducibility, which is important for scientific rigor. If we give Claude the same tools and inputs, will it perform just as well? This would help the community understand if GPT-5 is novel, or if the novelty is in how the user is prompting it
I don't mean to be cynical, but I don't think these points matter as much as you think, at least not in practice. The hardest part of a proof is working out the intermediate steps; joining them up is often trivial, even for a student. So even if it works out a few good steps or finds an effective theorem to apply, and does so only every one in a hundred prompts, the time savings can be significant.
I should know, I've been using LLM thinking models to help brainstorm ideas for stickier proofs. It's been more successful at discovering esoteric entry points than I would like to admit.
> This is one example. How many other attempts did the person try that failed to be useful, accurate, coherent? The author is an OpenAI employee IIUC, so it begs this question. Sora's demos were amazing until you tried it, and realized it took 50 attempts to get a usable clip.
If you could combine this with automated theorem proving, it wouldn't matter if it was right only 1 out of a 1000 times.
I don’t get why so many people are resistant to the concept that AI can prove new mathematical theorems.
The entire field of math is fractal-like. There are many, many low hanging fruits everywhere. Much of it is rote and not life changing. A big part of doing “interesting” math is picking what to work on.
A more important test is to give an AI access to the entire history of math and have it _decide_ what to work on, and then judge it for both picking an interesting problem and finding a novel solution.
1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data
2. People glamorize math and feel like advancements in it would "be AGI"
They don't realize that having it generate "new math" is not much harder than having it generate "new programs." Instead of writing something in Python, it's writing something in Lean.
As others have said computers already help prove theorems like the four color theorem. It’s not that shocking that LLMs can prove a relative handful of obscure theorems. An alpha-theorem (neural net directed “brute force” search) type system will probably also be able to prove some theorems. There is no evidence today that there will be a massive breakthrough in math due to those systems let alone through LLM type systems.
If LLMs were already a breakthrough in proving theorems, even for obscure minor theorems, there would be a massive increase in published papers due to publish or perish academic incentives.
I’m absolutely confident that AI/LLM can solve things, but you have to shift through a lot of crap to get there. Even further, it seems AI/LLM tend to solve novel problems in very unconventional ways. It can be very hard to know if an attempt is doomed, or just one step away from magic.
That's not the issue. The issue has always been that of knowledge and epistemology.
This is why the computer-assisted proof of the four-color theorem was such a talking point in math/cs-circles: how do you "really" know what was proven. This is slightly different from say an advisor who trains his students : you can often sketch out a proof, even though the details require quite a bit of work.
I think a simple way to take emotion out of this is to ask if a computer can beat humans at math. The answer to that is pretty much "duh". Symbolic solvers and numerical methods outperform humans by a wide margin and allow us to reach fundamentally new frontiers in mathematics.
But it's a separate question of whether this is a good example of that. I think there is a certain dishonesty in the tagline. "I asked a computer to improve on the state-of-the-art and it did!". With a buried footnote that the benchmark wasn't actually state-of-the-art, and that an improved solution was already known (albeit structured a bit differently).
When you're solving already-solved problems, it's hard to avoid bias, even just in how you ask the question and otherwise nudge the model. I see it a lot in my field: researchers publish revolutionary results that, upon closer inspection, work only for their known-outcome test cases and not much else.
Another piece of info we're not getting: why this particular, seemingly obscure problem? Is there something special about it, or is it data dredging (i.e., we tried 1,000 papers and this is the only one where it worked)?
A monkey hammering gibberish on a keyboard can prove new math given sufficient time. That's a low bar to set. The question is if the signal-to-noise ratio is high enough for it to be worthwhile.
I like the idea of letting AI try to formulate new math problems that are interesting, i.e. worthy research level. I guess we are still a number of iterations away till AI get there though..
There are more programmers resistant to the concept of AI because of pride.
Programmers take pride in their ability to program and to reduce their own abilities into an algorithm reproducible by an LLM is both an attack on their pride and an attack on their livelihood.
It’s the same reason why artists say AI art is utter crap when in a blind folded test they usually won’t be able to tell the difference.
interesting if true, but this isn't the first time we heard of something like this
quanta published an article that talked about a physics lab asking chatGPT to help come up with a way to perform an experiment, and chatGPT _magically_ came up with an answer worth pursuing. but what actually happened was chatGPT was referencing papers that basically went unread from lesser famous labs/researchers
this is amazing that chatGPT can do something like that, but `referencing data` != `deriving theorems` and the person posting this shouldn't just claim "chatGPT derived a better bound" in a proof, and should first do a really thorough check if it's possible this information could've just ended up in the training data
> what actually happened was chatGPT was referencing papers that basically went unread from lesser famous labs/researchers
Which is actually huge. Reviewing and surfacing all the relevant research out there that we are just not aware of would likely have at least as much impact as some truly novel thing that it can come up with.
If you think of this as a search, retrieval and “application” problem on the space of convex optimization proof techniques, it’s not a particularly striking result to a mathematician. Partly because: the space of results/techniques and crucially applications of those results and proof techniques is very rich (it’s an active field with many follow up papers).
On the other hand, I have a collection of unpublished results in less active fields that I’ve tested every frontier model on (publicly accessible and otherwise) and each time the models have failed to solve them. Some of these are simply reformulations of results in the literature that the models are unable to find/connect which is what leads me to formulate this as a search problem with the space not being densely populated enough in this case (in terms of activity in these subfields).
Any mathematicians who have actually called it "new interesting mathematics", or just an OpenAI employee?
The paper in question is an arxiv preprint whose first author seems to be an undergraduate. The theorem in it which GPT improves upon is perfectly nice, there are thousands of mathematicians who could have proved it had they been inclined to. AI has already solved much harder math problems than this.
I'm not sure why this is surprising or newsworthy; it has been this way ever since o3. I guess few people noticed.
There are a few masters-level publishable research problems that I have tried with LLMs on thinking mode, and it had produced a nearly complete proof before we had a chance to publish it. Like the problem stated here, these won't set the world on fire, but they do chip away at more meaningful things.
It often doesn't produce a completely correct proof (it's a matter of luck whether it nails a perfect proof), but it very often does enough that even a less competent student can fill in the blanks and fix up the errors. After all, the hardest part of a proof is knowing which tools to employ, especially when those tools can be esoteric.
"Now the only reason why I won't post this as an arxiv note, is that the humans actually beat gpt-5 to the punch :-). Namely the arxiv paper has a v2 arxiv.org/pdf/2503.10138v2 with an additional author and they closed the gap completely, showing that 1.75/L is the tight bound."
I really don't know what to make of this. The conclusion is that a model could still do this without the paper containing the exact info on how to do this ?
If you want a great book on the history of financial speculation, Devil Take the Hindmost (https://www.amazon.com/dp/0452281806/) is a strong recommendation.
Hypothesis: If you had ~1M dollar to burn, I think we should try setting up an AI agent to explore and try to invent new mathematics. It turns out agents can get an IMO gold with Gemini 2.5 Pro production model only. Therefore I suspect a swarm of agents burning through tokens like there's no tomorrow can invent new math.
Alternative: If Gemini Deep Think or GPT5-Pro people are listening, I think they should give free access to their models with potential scaffolding (ie. agentic workflow) to say some ~100 researchers to see if any of them can prove new math with their technology.
"Claim: GPT-5-pro can prove new interesting mathematics"
s/prove/produce/g
I'm inclined to regard an LLM as modelling a collection of fuzzy production rules which occur in a hierarchical collection of semi-formal systems; an LLM attempts to produce typographically correct theorems, the proving occurs at the level of semantics. Meaning requires a mind to erect an isomorphic mapping which the LLM is not capable of.
In other words, for the LLM the math is just symbols on a page that are arranged according to the typographic rules which it has an imperfect model of. On this view, nothing about what is happening with Gen AI is particularly surprising or novel.
I wanted to know how to set the environment variables for CGI in IIS.
The GPT 5 thoughts made a totally unrelated picture and then gave the wrong answer.
whymauri|6 months ago
Our chemists were split: some argued it was an artifact, others dug deep and provided some reasoning as to why the generations were sound. Keep in mind, that was a non-reasoning, very early stage model with simple feedback mechanisms for structure and molecular properties.
In the wet lab, the model turned out to be right. That was five years ago. My point is, the same moment that arrived for our chemists will be arriving soon for theoreticians.
wenc|6 months ago
For instance, you can put a thousand temperature sensors in a room, which give you 1000 temperature readouts. But all these temperature sensors are correlated, and if you project them down to latent space (using PCA or PLS if linear, projection to manifolds if nonlinear) you’ll create maybe 4 new latent variables (which are usually linear combinations of all other variables) that describe all the sensor readings (it’s a kind of compression). All you have to do then is control those 4 variables, not 1000.
In the chemical space, there are thousands of possible combinations of process conditions and mixtures that produce certain characteristics, but when you project them down to latent variables, there are usually less than 10 variables that give you the properties you want. So if you want to create a new chemical, all you have to do is target those few variables. You want a new product with particular characteristics? Figure out how to get < 10 variables (not 1000s) to their targets, and you have a new product.
svantana|6 months ago
1. https://en.wikipedia.org/wiki/Evolved_antenna
kmarc|6 months ago
https://www.economist.com/science-and-technology/2025/07/02/...
My understanding is, iterating on possible sequences (of codons, base pairs, etc) is exactly what LLMs, these feedback-looped predictor machines, are especially great at. With the newest models, those that "reason about" (check) their own output, are even better at it.
apimade|6 months ago
Similar for physicists, I think there’s a very confusing/unconventional antenna called the “evolved antenna” which was used on a NASA spacecraft. The idea behind it was supported from genetic programming. The science or understanding “why” the way the antenna bends at different areas supporting increased gain is not well understood by us today.
This all boils down to empirical reasoning, which underlies the vast majority of science (or science adjacent fields like software engineering, social sciences etc).
The question I guess is; does LLMs, “AI”, ML give us better hypothesis or tests to run to support empirical evidence-based science breakthroughs? The answer is yes.
Will these be substantial, meaningful or create significant improvements on today’s approaches?
I can’t wait to find out!
pojzon|6 months ago
Wouldnt that mean a fall of US pharmaceutical conglomate based on current laws about copyright and AI content?
ACCount37|6 months ago
You never quite know.
Right now, it's mostly the former. I fully expect the latter to become more and more common as the performance of AI systems improves.
brandonb|6 months ago
lukev|6 months ago
GPT-5 (and other LLMs) are by definition language models and though they will happily spew tokens about whatever you ask, they don't necessarily have the training data to properly encode the latent space of (e.g) drug interactions.
Confusing these two concepts could be deadly.
freshtake|6 months ago
A few things to consider:
1. This is one example. How many other attempts did the person try that failed to be useful, accurate, coherent? The author is an OpenAI employee IIUC, so it begs this question. Sora's demos were amazing until you tried it, and realized it took 50 attempts to get a usable clip.
2. The author noted that humans had updated their own research in April 2025 with an improved solution. For cases where we detect signs of superior behavior, we need to start publishing the thought process (reasoning steps, inference cycles, tools used, etc.). Otherwise it's impossible to know whether this used a specialty model, had access to the more recent paper, or in other ways got lucky. Without detailed proof it's becoming harder to separate legitimate findings from marketing posts (not suggesting this specific case was a pure marketing post)
3. Points 1 and 2 would help with reproducibility, which is important for scientific rigor. If we give Claude the same tools and inputs, will it perform just as well? This would help the community understand if GPT-5 is novel, or if the novelty is in how the user is prompting it
hodgehog11|6 months ago
I should know, I've been using LLM thinking models to help brainstorm ideas for stickier proofs. It's been more successful at discovering esoteric entry points than I would like to admit.
bawolff|6 months ago
If you could combine this with automated theorem proving, it wouldn't matter if it was right only 1 out of a 1000 times.
foobarqux|6 months ago
High chance given that this is the same guy that came up with SVG unicorn (sparks of AGI) which raises the same question even more obviously.
energy123|6 months ago
aabhay|6 months ago
The entire field of math is fractal-like. There are many, many low hanging fruits everywhere. Much of it is rote and not life changing. A big part of doing “interesting” math is picking what to work on.
A more important test is to give an AI access to the entire history of math and have it _decide_ what to work on, and then judge it for both picking an interesting problem and finding a novel solution.
trueismywork|6 months ago
https://mathstodon.xyz/@tao/114881418225852441
https://mashable.com/article/openai-claims-gold-medal-perfor...
Note that no one expressed skepticism of what google said when they claimed they achieved gold medal. But no one is willing to believe OpenAI.
ComplexSystems|6 months ago
1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data 2. People glamorize math and feel like advancements in it would "be AGI"
They don't realize that having it generate "new math" is not much harder than having it generate "new programs." Instead of writing something in Python, it's writing something in Lean.
foobarqux|6 months ago
If LLMs were already a breakthrough in proving theorems, even for obscure minor theorems, there would be a massive increase in published papers due to publish or perish academic incentives.
SkyPuncher|6 months ago
I’m absolutely confident that AI/LLM can solve things, but you have to shift through a lot of crap to get there. Even further, it seems AI/LLM tend to solve novel problems in very unconventional ways. It can be very hard to know if an attempt is doomed, or just one step away from magic.
hto2i34334324|6 months ago
This is why the computer-assisted proof of the four-color theorem was such a talking point in math/cs-circles: how do you "really" know what was proven. This is slightly different from say an advisor who trains his students : you can often sketch out a proof, even though the details require quite a bit of work.
xenotux|6 months ago
But it's a separate question of whether this is a good example of that. I think there is a certain dishonesty in the tagline. "I asked a computer to improve on the state-of-the-art and it did!". With a buried footnote that the benchmark wasn't actually state-of-the-art, and that an improved solution was already known (albeit structured a bit differently).
When you're solving already-solved problems, it's hard to avoid bias, even just in how you ask the question and otherwise nudge the model. I see it a lot in my field: researchers publish revolutionary results that, upon closer inspection, work only for their known-outcome test cases and not much else.
Another piece of info we're not getting: why this particular, seemingly obscure problem? Is there something special about it, or is it data dredging (i.e., we tried 1,000 papers and this is the only one where it worked)?
_Algernon_|6 months ago
tcshit|6 months ago
throwawaymaths|6 months ago
ninetyninenine|6 months ago
Programmers take pride in their ability to program and to reduce their own abilities into an algorithm reproducible by an LLM is both an attack on their pride and an attack on their livelihood.
It’s the same reason why artists say AI art is utter crap when in a blind folded test they usually won’t be able to tell the difference.
drudolph914|6 months ago
quanta published an article that talked about a physics lab asking chatGPT to help come up with a way to perform an experiment, and chatGPT _magically_ came up with an answer worth pursuing. but what actually happened was chatGPT was referencing papers that basically went unread from lesser famous labs/researchers
this is amazing that chatGPT can do something like that, but `referencing data` != `deriving theorems` and the person posting this shouldn't just claim "chatGPT derived a better bound" in a proof, and should first do a really thorough check if it's possible this information could've just ended up in the training data
martinpw|6 months ago
Which is actually huge. Reviewing and surfacing all the relevant research out there that we are just not aware of would likely have at least as much impact as some truly novel thing that it can come up with.
leeoniya|6 months ago
now let's invalidate probably 70% of all patents
mhh__|6 months ago
krnsll|6 months ago
On the other hand, I have a collection of unpublished results in less active fields that I’ve tested every frontier model on (publicly accessible and otherwise) and each time the models have failed to solve them. Some of these are simply reformulations of results in the literature that the models are unable to find/connect which is what leads me to formulate this as a search problem with the space not being densely populated enough in this case (in terms of activity in these subfields).
nybsjytm|6 months ago
The paper in question is an arxiv preprint whose first author seems to be an undergraduate. The theorem in it which GPT improves upon is perfectly nice, there are thousands of mathematicians who could have proved it had they been inclined to. AI has already solved much harder math problems than this.
offnominal|6 months ago
marcuschong|6 months ago
https://x.com/ErnestRyu/status/1958408925864403068?t=QmTqOcx...
nickip|6 months ago
hodgehog11|6 months ago
There are a few masters-level publishable research problems that I have tried with LLMs on thinking mode, and it had produced a nearly complete proof before we had a chance to publish it. Like the problem stated here, these won't set the world on fire, but they do chip away at more meaningful things.
It often doesn't produce a completely correct proof (it's a matter of luck whether it nails a perfect proof), but it very often does enough that even a less competent student can fill in the blanks and fix up the errors. After all, the hardest part of a proof is knowing which tools to employ, especially when those tools can be esoteric.
shaldengeki|6 months ago
https://xcancel.com/SebastienBubeck/status/19581986678373298...
poulpy123|6 months ago
dinobones|6 months ago
Context: https://x.com/GeoffLewisOrg/status/1945864963374887401
internet_points|6 months ago
aeve890|6 months ago
ofjcihen|6 months ago
lewhoo|6 months ago
I really don't know what to make of this. The conclusion is that a model could still do this without the paper containing the exact info on how to do this ?
trueismywork|6 months ago
osti|6 months ago
BoredPositron|6 months ago
bubblyworld|6 months ago
stevenhuang|6 months ago
Bad at arithmetic, promising at math: https://www.lesswrong.com/posts/qy5dF7bQcFjSKaW58/bad-at-ari...
corford|6 months ago
starchild3001|6 months ago
Reference: https://arxiv.org/abs/2507.15855
Alternative: If Gemini Deep Think or GPT5-Pro people are listening, I think they should give free access to their models with potential scaffolding (ie. agentic workflow) to say some ~100 researchers to see if any of them can prove new math with their technology.
d4rkn0d3z|6 months ago
s/prove/produce/g
I'm inclined to regard an LLM as modelling a collection of fuzzy production rules which occur in a hierarchical collection of semi-formal systems; an LLM attempts to produce typographically correct theorems, the proving occurs at the level of semantics. Meaning requires a mind to erect an isomorphic mapping which the LLM is not capable of. In other words, for the LLM the math is just symbols on a page that are arranged according to the typographic rules which it has an imperfect model of. On this view, nothing about what is happening with Gen AI is particularly surprising or novel.
eru|6 months ago
But yes, it's getting better and better.
EcommerceFlow|6 months ago
trueismywork|6 months ago
https://x.com/ErnestRyu/status/1958408925864403068?t=dAKXWtt...
strangescript|6 months ago
ACCount37|6 months ago
croes|6 months ago
yapyap|6 months ago
phoenixhaber|6 months ago
[deleted]
chiengineer|6 months ago
[deleted]
aaron695|6 months ago
[deleted]
sachinaag|6 months ago
[deleted]
rationalfaith|6 months ago
[deleted]
animanoir|6 months ago
[deleted]
mikert89|6 months ago
[deleted]
sunrunner|6 months ago
bgwalter|6 months ago
meroes|6 months ago
brcmthrowaway|6 months ago
ac29|6 months ago
mrcwinn|6 months ago