As a former problem designer for IMO and similar contests, I deeply enjoyed reading this paper. At the same time, I'd like to point out that it was clear Geometry had to be the first topic to give up against AI (i.e., smart knowledge and inference-method indexing)
Among math olympiad topics, Geometry problems are often the most "mechanical." Once you can express the problem in terms of coordinates (think XY or complex plane) you're looking at a finite set of steps a computer can use to find a solution. Of course, time limits and human error get in the way of this being practical during the IMO itself. Back in the day, I'd use WolframAlpha to verify geometry proofs for problems I designed (+conjectures) using this approach.
Algebra (especially inequalities) can be similar – pull off some intense calculations and you often have your answer.
Where I'm really excited to see intelligent systems make progress is with Number Theory and Combinatorics problems. The search spaces are far more complex and they often require proving that something is impossible. These are the problems that would be difficult to solve with brute force computation.
As a consumer of those problems, first of all, thank you! Even decades on from high school I occasionally enjoy working on them.
But agree that geometry was obviously going to be first. From what I’ve gathered here, it’s not “brute forcing” in terms of relying on algebraic geometry, vectors, or complex number solutions, but it is brute force in that it’s exhaustively looking for “interesting” constructions.
Geometry was always my worst subject, but even so I felt like if given the right construction the problem was much easier. Unfortunately I never developed the intuition to hit on those constructions quickly. It seems this AI doesn’t either, but it can churn through them much faster — There are only so many perpendiculars and parallels and bisectors you can construct which you can more or less mechanically evaluate (map out all angles and ratios, try power of a point, etc.)
While this is incredibly impressive, it seems like Deep Mind:Kasparov::AlphaGeo:Terry Tao in the “engine vs AI” sense.
I agree Algebra is likely next. Like geometry, more often than not you “just” need a clever substitution or 3, of which there are only so many options to choose from.
Some combinatorics problems I think would also be amenable to this search strategy (e.g., finding things to count 2 ways), but that seems like a bridge further away, and only gets you a subset of problems.
Number theory would be my guess for the final frontier before a perfect 42.
I like your positive attitude to this. Do you feel any sense of loss over the possibility your skill (being extremely good in math) might soon be overtaken by machines? Or do you see no possibility of that happening anytime soon?
How did you get into such a position? Is there an application process of some sort?
After verifying solvability, what is the process for selecting the specific problems that will show up in the finished set? Is it by vote or some other evaluation?
I expected three-number inequalities to give in first, as there is less of a question about what counts as a proof. I didn't know that the latter problem was already solved in 2000 ( http://www.mmrc.iss.ac.cn/~xgao/paper/jar-gdbase.pdf ).
Someone really should make an adventure game out of synthetic geometry, as it allows for a proof-writing language simpler than Lean's and allows for nice visuals.
If I read their paper right, this is legit work (much more legit than DeepMind's AI math paper last month falsely advertised as solving an open math research problem) but it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.
A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work. [edit: as corrected by Imnimo, I got this backwards - the brute search is just pure brute search, the transformer is used to predict which extra geometric construction to add]
Also (not mentioned in the blog post) the actual problem statements had to be modified/adapted, e.g. the actual problem statement "Let AH1, BH2 and CH3 be the altitudes of a triangle ABC. The incircle W of triangle ABC touches the sides BC, CA and AB at T1, T2 and T3, respectively. Consider the symmetric images of the lines H1H2, H2H3, and H3H1 with respect to the lines T1T2, T2T3, and T3T1. Prove that these images form a triangle whose vertices lie on W." had to be changed to "Let ABC be a triangle. Define point I such that AI is the bisector of angle BAC and CI is the bisector of angle ACB. Define point T1 as the foot of I on line BC. Define T2 as the foot of I on line AC. Define point T3 as the foot of I on line AB. Define point H1 as the foot of A on line BC. Define point H2 as the foot of B on line AC. Define point H3 as the foot of C on line AB. Define point X1 as the intersection of circles (T1,H1) and (T2,H1). Define point X2 as the intersection of circles (T1,H2) and (T2,H2). Define point Y2 as the intersection of circles (T2,H2) and (T3,H2). Define point Y3 as the intersection of circles (T2,H3) and (T3,H3). Define point Z as the intersection of lines X1X2 and Y2Y3. Prove that T1I=IZ."
>A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work.
I don't think this is quite right. The brute force search is performed by a symbolic solver, not the transformer. When it runs out of new deductions, the transformer is asked to suggest possible extra constructions (not randomly added).
> it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.
How so? Reasoning is fundamentally a search problem.
The process you described is exactly the process humans use: i.e. make a guess about what's useful, try to work out details mechanically. If get stuck, make another guess, etc. So the process is like searching through a tree.
People figured out this process back in 1955 (and made a working prototype which can prove theorems): https://en.wikipedia.org/wiki/Logic_Theorist but it all hinges on using good 'heuristics'. Neural networks are relevant here as they can extract heuristics from data.
What do you think is "the usual idea of automated reasoning"? Some magic device which can solve any problem using a single linear pass?
> When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work
That's exactly how I was taught geometry in school and I hated it with my all guts. Only after making it into the math department of the university I learned to do it properly and to enjoy it.
The problem is that using LLM as a role for drawing auxiliary lines is too inefficient. It is hard to imagine people deploying a large number of machines to solve a simple IMO problem. This field must be in the early stage of development, and much work remains unfinished. A reasonable point of view is that the search part should be replaced by a small neural network, and the reasoning part should not be difficult, and does not require much improvement. Now is the time to use self-play to improve performance, treating the conclusions that need to be proved in plane geometry problems as a point in the diagram and the conditions as another point in the diagram. Then two players try to move towards each other as much as possible and share data, so that the contribution made by each player in this process can be used as an analogy for calculating wins and losses in Go, and thus improve performance through self-play.
Though this particular model doesn't sound generalizable, the neuro-symbolic approach seems very promising to me:
- linking the (increasingly powerful) "system 1" tools that are most of current ML with more structured "system 2" tools, like logical proof generation, which can plan and and check the veracity / value of output.
- System 2 chugs along 'till it gets stuck, then system 1 jumps in to provide an intuitive guess on what part of state space to check next.
- Here they leveraged the ability to generate proofs by computer to create a data set of 100m proofs, enabling scalable self-supervised learning. Seem to me the symbolic domains are well formed to allow such generation of data, which while low value in each instance, might allow valuable pre-training in aggregate.
Putting these elements together is an approach that could get us quite far.
The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.
> The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.
You do not need to solve everything at once. This approach has the potential to revolution both math and programming by moving formal verification from being a niche tool into a regular part of every practitioners toolbox.
It also completely solves (within the domain it applies) one of the most fundamental problems of AI that the current round is calling "hallucinations"; however that solution only works because we have a non AI system to prove correctness.
At a high level, this approach is not really that new. Biochem has been using AI to help find candidate molecules, which are then verified by physical experimentation.
Combinatorical game AI has been using the AI as an input to old fasion monte carlo searches
I appreciate that the authors released code and weights with their paper! This is the first high-profile DeepMind paper I can recall that has runnable inference code + checkpoints released. (Though I'm happy to be corrected by earlier examples I've missed)
I don't yet see a public copy of the training set / example training code, but still this is a good step towards providing something other researchers can build on -- which is after all the whole point of academic papers!
Yeap. I'm missing the datasets as well. They have generated 100M synthetic examples ... Were these examples generated with AlphaGeometry? Where is the filtering code and initial input to generate these synthetics?
Im I'm wrong that they are using t5 model? At least they are using the sentencepiece t5 vocabulary.
How many GPU hours have they spend training this model? Which training parameters were used?
Don't get me wrong, I find this system fascinating it is what applied engineering should look like. But I'd like to know more about the training details and the initial data they have used as well as the methods of synthetic data generation.
I'm very curious how often the LM produces a helpful construction. Surely it must be doing better than random chance, but is it throwing out thousands of constructions before it finds a good one, or is it able to generate useful proposals at a rate similar to human experts?
They say in the paper, "Because the language model decoding
process returns k different sequences describing k alternative auxiliary
constructions, we perform a beam search over these k options, using
the score of each beam as its value function. This set-up is highly parallelizable across beams, allowing substantial speed-up when there are
parallel computational resources. In our experiments, we use a beam
size of k = 512, the maximum number of iterations is 16 and the branching factor for each node, that is, the decoding batch size, is 32."
But I don't totally understand how 512 and 16 translate into total number of constructions proposed. They also note that ablating beam size and max iterations seems to only somewhat degrade performance. Does this imply that the model is actually pretty good at putting helpful constructions near the top, and only for the hardest problems does it need to produce thousands?
Interesting that the transformer used is tiny. From the paper:
"We use the Meliad library for transformer training with its base settings. The transformer has 12 layers, embedding dimension of 1,024, eight heads of attention and an inter-attention dense layer of dimension 4,096 with ReLU activation. Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads. Our customized tokenizer is trained with ‘word’ mode using SentencePiece and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding. Sequence packing is also used because more than 90% of our sequences are under 200 in length."
It's not tiny, this is a quite normal size outside the field of LLMs, e.g. normal-sized language models, or also translation models, or acoustic models. Some people even would call this large.
This suggests there may be some more low hanging fruit in the hard sciences for transformers to bite at, so long as they can be properly formulated. This was not a problem of scaling it seems.
The real TIL (to me) is that the previous state-of-the-art could solve 10 of these! I'd heard there was a decision algorithm for plane geometry problems but I didn't know it was a practical one. Some searching turned up http://www.mmrc.iss.ac.cn/~xgao/paper/book-area.pdf as a reference.
Yes, and even the non-neural-network, symbolic plus linear algebra component of AlphaGeometry is able to outperform the previous state of the art. So a decent amount of work here went into the components that aren't neural networks at all.
This is nice, but I suspect that a barycentric coordinate bash with the formulas in Evan Chen's book could also do 30% of the IMO (seeing that most of it is about triangles) on a modern laptop.
I was all ready to be skeptical for most of these kind of works its outputs are 'not like human proofs' but then I saw Evan Chen's quote that it is indeed clean human-readable proof. Evan Chen is a prominent member of Olympiad math community and an author of famous Olympiad geometry book[1] so this time I will have to concede that machines indeed have conquered part of the IMO problems.
But then again, there's an error in the proof of IMO P3, in Fig1.f and Step 26. of the full proof in supplimentary material[1], which it states that ∠GMD = ∠GO2D, which is incorrect. It should be ∠GMD + ∠GO2D = π. I am trying to follow its logic but I cannot parse Step 25. Did it hallucinate this step?
It has the right idea of O2 being on the nine point circle though.
edit: I retract my statement. It looks like it is using directed angles[2] which then the statement becomes correct.
>To train AlphaGeometry's language model, the researchers had to create their own training data to compensate for the scarcity of existing geometric data. They generated nearly half a billion random geometric diagrams and fed them to the symbolic engine. This engine analyzed each diagram and produced statements about their properties. These statements were organized into 100 million synthetic proofs to train the language model.
With all the bickering about copyright, could something similar be used for coding llms? Would kill the ip issues, at least for coding
For language, the symbolic engine itself would likely be trained on copyrighted input, unlike the geometry engine, since math & math facts are not covered by copyright.
If you couple random text genrsrion and such an engine for language, you'd be laundering your training using the extra step (and the quality will likely be worse due to multiplicative errors)
What statements about properties of randomly generated code snippets would be useful for coding LLMs? You would need to generate text explaining what each snippet does, but that would require an existing coding LLM, so any IP concerns would persist.
There's a worked example near the article's start proving that an isoceles triangle has equal angles: the so-called pons asinorum. The caption concludes with "In this example, just one construct is required."
That strikes me as odd, because I'd expect the symbolic engine to require zero such constructs to find a proof. Although dropping a bisector is the standard grade-school proof strategy, there's a well-known proof that defines no additional entities: since AB = AC, we have (by side-side-side) that △ABC ≅ △ACB, and therefore that ∠ABC = ∠ACB.
It's funny because what that proof is most famous for is having been rediscovered by an automated theorem prover in the 1950s-1960s. (Wikipedia states that this is a myth, and that it was only rediscovered by Marvin Minsky pretending to be an automated theorem prover... which is true; but then Herbert Gelernter later wrote such a theorem prover! [1])
As soon as ChatGPT got released, I tried to make it solve IMO-style problems. It failed.
If this is legit, this is huge. Finding geometric proofs means proving things, and proving propositions is what an intelligence does. In my opinion, this is the closest we have gotten so far to AGI. Really excited to see what the future holds.
Proving propositions has been computer work for several decades.
This is not a pure LLM. Search algorithms are obviously incredibly powerful at solving search problems. The LLM contribution is an "intuitive" way to optimize the search.
As soon as ChatGPT got released, I tried to make it solve IMO-style problems. It failed.
Have you tried the same questions with ChatGPT 4? It is a transformational change (no pun intended) over the earlier releases, and over all open-source models.
Just today, I needed to interpret some awkwardly-timestamped log data. I asked it a few questions along the lines of "What time it was 10,000 seconds before xx:yy:zz?" It didn't give me the answer, but it wrote and executed a Python program that did.
It looks like there's some interesting works to connect ML with symbolic reasoning (or searching). I'm closer to layman in this area but IIUC the latter is known to be rife with yet-to-be-understood heuristics to prune out the solution space and ML models are pretty good at this area.
I'm not in a position to suggest what needs to happen to further push the boundary, but in my impression it looks like the current significant blocker is that we don't really have a way to construct a self-feedback loop structure that consistently iterates and improves the model from its own output. If this can be done properly, we may see something incredible in a few years.
I didn't look further into their method, but it seems to me that symbolic reasoning is only important insofar as it makes the solution verifiable. That still is a glorious capability, but a very narrow one.
Google DeepMind keeps publishing these things while having missed the AI-for-the-people train, that I'm getting more and more the feeling that they have such a huge arsenal of AI-tech piling up in their secret rooms, as if they're already simulating entire AI-based societies.
As if they're stuck in a loop of "let's improve this a bit more until it is even better", while AGI is already a solved problem for them.
Or they're just an uncoordinated, chaotic bunch of AI teams without a leader who unifies them. With leader I don't mean Demis Hassabis, but rather someone like Sundar Pichai.
I don’t get the criticism. It seems like basic research and the kind of thing that would lead the way to “AGI” (combining llm-style prediction with logical reasoning). Unless you’re talking about what’s the point of publishing Nature papers - then it’s probably that the people involved want some concrete recognition in their work up to this point. And I supposed tech/investor press until they get something useful out from it.
What they did here is one of the first steps towards an AI that generates nontrivial and correct mathematical proofs. (An alternative benchmark would be IMO-level inequalities. Other topics such as algebra, combinatorics and number theory are probably no easier than the rest of mathematics, thus less useful as stepping stones.)
And an AI that generates nontrivial and correct mathematical proofs would, in turn, be a good first step towards an AI that can "think logically" in the common-sense meaning of the word (i.e., not just mess around with words and sentences but actually have some logical theory behind what it is saying). It might be a dead end, but it might not be. Even if it is, it will be a boon to mathematics.
My understanding is that they encoded domain in several dozens of mechanical rules described in extended data table 1, and then did transformer-guided brute force search for solutions.
The problem is that LLM as a role for drawing auxiliary lines is too inefficient. It is hard to imagine people deploying a large number of machines to solve a simple IMO problem. This field must be in the early stage of development, and much work remains unfinished
Very interesting. I was suspicious they could get an LLM to do this alone, and reading the description they don't. It's still AI I would say, maybe even more so since it alternates between a generative machine learning model to come up with hypothesis points. But then it uses a optimizer to solve for those pointa and simplify the problem. So it really is a mathematical reasoning system. Also good to note that is tuned specificially to these types of geometry problems.
What deductive system are they using to verify the proofs? I'm asking because the conventions of olympiad geometry are slightly different from those of the rest of mathematics (you can make informal "general position" arguments; you can pooh-pooh issues of sign and betweenness; you can apply theorems to unstated limiting cases; etc.), and it is far from obvious to me how this logic can be formalized without causing contradictions.
They built their own symbolic geometry engine, called DDAR. The IMO judges will typically let you get away with much less rigor than a symbolic engine like this produces, so I don't think they are taking advantage of any of these conventions.
This sounds like the famous system 1 and system 2 thinking with the prover doing the systematic system 2 thinking and the llm doing the intuitive system 1 thinking.
Oh hey! I ghost-wrote an earlier article that references the same idea (https://blog.google/technology/ai/bard-improved-reasoning-go...). I wonder if they came to it independently or if they read mine and it planted some seeds? Either way, super cool work.
summary: generate 500M synthetic examples where you start with random constructions, deduce (not sure until saturation or fixed depth) and use each conclusion along with its dependency graph and original premises as a training example. The 500M reduces to 100M after dedupe and they used 7.5M CPU hours. 9M/100M used auxiliary/exogenous constructions (which I think of as like if you're doing this on program search and the program requires some constant pi that isn't derivable from the inputs). They encode the examples for a transformer and finetune on the auxiliary construction subset to eek out 2 more problems. Then beam search using the model to generate next term and logic engine to expand the closure until the conclusion is in the set.
They make a big deal about these auxiliary constructions but I'm not sure the results back that up. I also don't understand what the "without pretraining" means; they say without pretraining it gets 21 problems, without fine-tuning 23 problems, and altogether 25. So are they getting 21/25 without their synthetic data at all??
I want to look at how the dependency graph is encoded in the training examples and whether you might expect different results from different encodings. Ideally I'd think you'd want that part of the model/embedding to be invariant across all possible (big number!) of encodings.
The human comparisons are really weak in Fig 2 and unfortunately is probably what the media mainly focuses on. I do find the results interesting and wonder how well this translates into program synthesis for example. The paper mentioned in these comments which I have yet to read more "Language Models Can Teach Themselves to Program Better" sounds related. I'd be very interested to see more methods that are very different from LLM heuristic beam search, even though it does seem compelling and they get to publish lots of papers with it.
>> I do find the results interesting and wonder how well this translates into program synthesis for example.
In broad terms it's the same generate-and-test approach used for their AlphaCode system but with a theorem prover as a test step. In AlphaCode they use clustering and various heuristics to select programs, here they use the prover.
Eventually, I expect, they'll figure out you can use an automated theorem prover to check the correctness of (logic) programs.
Maybe it's mentioned somewhere but I missed it, could someone provide a list of the five problems that AlphaGeometry failed to solve? The paper mentioned IMO 2008p6 and IMO 2019p2. What are the other three?
I wonder how well it would do with html page layout. It is nice to see advancements in geometric reasoning!
To give you a basis for comparison, I've found that (paid) ChatGPT even quite recently was not able to reason about simple html and css geometric constructs, such as what div is inside what container: with just two divs or containers on the page, it still applied styling to the wrong one and I had to move the line myself to the one it meant to style, even after I described to ChatGPT what the page currently looked like.
I just tried it again, it did succeed with this prompt:
"Create an HTML page that you divide in half vertically, you will set the background of divs to different colors. After dividing the page in half vertically, on the left half of the page put three different divs each taking up a third of the vertical space, color them red, orange, yellow. On the right half of the page create two different divs each taking half of the vertical space, color them green, blue. Within each div put a paragraph in black letters with the color and location of the div it is located in, such as "Left hand side, top div. This background is red." and so on for top, middle, and bottom on the left and top and bottom on the right, with the colors red, orange, yellow, green and blue respectively as their backgrounds. Reply with the complete html and css style page in single html file that I can copy and paste into a single file, when loaded it should show the layout I have described." When I try this right now I get a successsful page.
It previously failed at similarly layout questions.
If ChatGPT had this "Olympiad-level" geometric reasoning it would be amazing for producing html pages that include complex layout. It currently doesn't always succeed at this.
Layout is the area that I've found ChatGPT is weakest in for generating code, which I chalk up to the fact that it doesn't have a vision subsystem so it can't really picture about what it's doing.
When a human solves these problems, you might state a "step" in natural language. Like, "apply a polar transformation to this diagram". But that same single step, if you translate into a low-level DSL, could be dozens of different steps.
For an example, check out this formalization of a solution to a non-geometry IMO problem:
It's not precisely clear what counts as a "step" but to me this seems like roughly 50 steps in the low-level formalized proof, and 5 steps in the human-language proof.
So, my conclusion is just that I wouldn't necessarily say that a 150-step automated proof was inelegant. It can also be that the language used to express these proofs fills in more of the details.
It's like the difference between counting lines of code in compiled code and a high-level language, counting the steps is really dependent on the level at which you express it.
Would something like this help ai/computer see the world better for robotics and self driving etc. I am not a programmer or in this field but I don't know why intuitally reading the headline alone made me feel the world of robotic and self driving etc ie where ai interact with the physical world is about to change.
What happens to science when the most talented kids won't be able to compete in ~2 years tops? Would our civilization reach plateau or start downward trajectory as there will be no incentive to torture oneself to become the best in the world? Will it all fade away like chess once computers started beating grandmasters?
I've accepted that the human form, with its ~3 pounds of not especially optimized brainpower, is basically going to be relegated to the same status as demoscene hardware for anything that matters after this century.
That's cool by me, though. This bit of demoscene hardware experiences qualia, and that combination is weird and cool enough to make me want to push myself in new and weird directions. That's what play is in a way.
The incentive to compete in the IMO is that it's fun to do math contests, it's fun to win math contests, and (if you think that far ahead as a high schooler) it looks good on your resume. None of that incentive will go away if the computers get better at math contests.
Unless you can figure out AI that is available everywhere, all the time (even in the most remote jungle with no power), there will always be value in making humans smarter
It was interesting. One of the reviewers noted that one of the generated proofs didn’t seem that elegant in his opinion.
Given enough compute, I wonder how much this would be improved by having it find as many solutions as possible within the allotted time and proposing the simplest one.
The issue is that the computer-generated proofs operate at a very low level, step by step, like writing a program in assembly language without the use of coherent structure.
The human proof style instead chunks the parts of a solution into human-meaningful "lemmas" (helper theorems) and builds bodies of theory into well-defined and widely used abstractions like complex numbers or derivatives, with a corpus of accepted results.
Some human proofs of theorems also have a bit of this flavor of inscrutable lists of highly technical steps, especially the first time something is proven. Over time, the most important theorems are often recast in terms of a more suitable grounding theory, in which they can be proven with a few obvious statements or sometimes a clever trick or two.
The most surprising observation for me is that the well known Wu’s Method for proving geometric theorems solved 10 problems ! AlphaGeometry solved 25 problems.
“What the I.M.O. is testing is very different from what creative mathematics looks like for the vast majority of mathematicians,” he said.
---
Not to pick on this guy, but this is ridiculous goal post shifting. It's just astounding what people will hand-wave away as not requiring intelligence.
I guess it kind of depends on what a goal post should represent. If it represents an incremental goal then obviously you are correct.
I think the ultimate goal though is clearly that AI systems will be able to generate new high-value knowledge and proofs. I think the realistically the final goal post has always been at that point.
Also if you read about the structure of AlphaGeometry, I think it's very hard to maintain that it "requires intelligence." As AI stuff goes, it's pretty comprehensible and plain.
Not a subject matter expert, so forgive me if the question is unintentionally obtuse, but It seems like a reasonable statement. They seem to be inferring that problems with existing public solutions wouldn't be a good indicator of performance in solving novel problems-- likely a more important evaluative measure than how fast it can replicate an answer it's already seen. Since you couldn't know if that solution was in its training data, you couldn't know if you were doing that or doing a more organic series of problem solving steps, therefore contaminating it. What's the problem with saying that?
This is fascinating and inspiring progress! What's crazy to me is how many people are looking to put stakes in the ground in front of this, claiming breathlessly before anyone can speak, that this isn't real AI progress or this isn't actually doing it how humans supposedly do, and implying that therefore it's not a big deal. I'm sorry to break to you folks, this is just another sign post moment. One more event to mark our place in history as we sail on. And when we look back on it, there isn't going to be all the disclaimers people are trying to throw down now. We had imagenet and deep blue, we had alphazero, alphago, GPT4, alphafold, and now we have AlphaGeometry.As DJ Khaled likes to point out, this is.. Another one
This reads like another "edge lord" or "karma farming "comment. Seems to be more and more frequent here lately.
The point of the forum is that people who are interested in these things can discuss them together openly. People are having discussions about the way they interpret the paper and what they think it means. You seem to be telling those people that's pointless and we should just accept this as a breakthrough because you told us so. Which I think goes against the spirit of this forum and the ability for people to look at the world with an objective lens.
If this work is as good as you say it is, and it very well might be, it will stand on it's own no matter what people say about it. Time will tell if this was meaningful or not.
dang|2 years ago
(via https://news.ycombinator.com/item?id=39030186, but we'll merge that thread hither)
empath-nirvana|2 years ago
szmerdi|2 years ago
Algebra (especially inequalities) can be similar – pull off some intense calculations and you often have your answer.
Where I'm really excited to see intelligent systems make progress is with Number Theory and Combinatorics problems. The search spaces are far more complex and they often require proving that something is impossible. These are the problems that would be difficult to solve with brute force computation.
fovc|2 years ago
But agree that geometry was obviously going to be first. From what I’ve gathered here, it’s not “brute forcing” in terms of relying on algebraic geometry, vectors, or complex number solutions, but it is brute force in that it’s exhaustively looking for “interesting” constructions.
Geometry was always my worst subject, but even so I felt like if given the right construction the problem was much easier. Unfortunately I never developed the intuition to hit on those constructions quickly. It seems this AI doesn’t either, but it can churn through them much faster — There are only so many perpendiculars and parallels and bisectors you can construct which you can more or less mechanically evaluate (map out all angles and ratios, try power of a point, etc.)
While this is incredibly impressive, it seems like Deep Mind:Kasparov::AlphaGeo:Terry Tao in the “engine vs AI” sense.
I agree Algebra is likely next. Like geometry, more often than not you “just” need a clever substitution or 3, of which there are only so many options to choose from.
Some combinatorics problems I think would also be amenable to this search strategy (e.g., finding things to count 2 ways), but that seems like a bridge further away, and only gets you a subset of problems.
Number theory would be my guess for the final frontier before a perfect 42.
weatherlite|2 years ago
Dracophoenix|2 years ago
After verifying solvability, what is the process for selecting the specific problems that will show up in the finished set? Is it by vote or some other evaluation?
generationP|2 years ago
Someone really should make an adventure game out of synthetic geometry, as it allows for a proof-writing language simpler than Lean's and allows for nice visuals.
nybsjytm|2 years ago
A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work. [edit: as corrected by Imnimo, I got this backwards - the brute search is just pure brute search, the transformer is used to predict which extra geometric construction to add]
Also (not mentioned in the blog post) the actual problem statements had to be modified/adapted, e.g. the actual problem statement "Let AH1, BH2 and CH3 be the altitudes of a triangle ABC. The incircle W of triangle ABC touches the sides BC, CA and AB at T1, T2 and T3, respectively. Consider the symmetric images of the lines H1H2, H2H3, and H3H1 with respect to the lines T1T2, T2T3, and T3T1. Prove that these images form a triangle whose vertices lie on W." had to be changed to "Let ABC be a triangle. Define point I such that AI is the bisector of angle BAC and CI is the bisector of angle ACB. Define point T1 as the foot of I on line BC. Define T2 as the foot of I on line AC. Define point T3 as the foot of I on line AB. Define point H1 as the foot of A on line BC. Define point H2 as the foot of B on line AC. Define point H3 as the foot of C on line AB. Define point X1 as the intersection of circles (T1,H1) and (T2,H1). Define point X2 as the intersection of circles (T1,H2) and (T2,H2). Define point Y2 as the intersection of circles (T2,H2) and (T3,H2). Define point Y3 as the intersection of circles (T2,H3) and (T3,H3). Define point Z as the intersection of lines X1X2 and Y2Y3. Prove that T1I=IZ."
Imnimo|2 years ago
I don't think this is quite right. The brute force search is performed by a symbolic solver, not the transformer. When it runs out of new deductions, the transformer is asked to suggest possible extra constructions (not randomly added).
killerstorm|2 years ago
How so? Reasoning is fundamentally a search problem.
The process you described is exactly the process humans use: i.e. make a guess about what's useful, try to work out details mechanically. If get stuck, make another guess, etc. So the process is like searching through a tree.
People figured out this process back in 1955 (and made a working prototype which can prove theorems): https://en.wikipedia.org/wiki/Logic_Theorist but it all hinges on using good 'heuristics'. Neural networks are relevant here as they can extract heuristics from data.
What do you think is "the usual idea of automated reasoning"? Some magic device which can solve any problem using a single linear pass?
alexey-salmin|2 years ago
That's exactly how I was taught geometry in school and I hated it with my all guts. Only after making it into the math department of the university I learned to do it properly and to enjoy it.
hxypqr|2 years ago
NooneAtAll3|2 years ago
marojejian|2 years ago
- linking the (increasingly powerful) "system 1" tools that are most of current ML with more structured "system 2" tools, like logical proof generation, which can plan and and check the veracity / value of output. - System 2 chugs along 'till it gets stuck, then system 1 jumps in to provide an intuitive guess on what part of state space to check next. - Here they leveraged the ability to generate proofs by computer to create a data set of 100m proofs, enabling scalable self-supervised learning. Seem to me the symbolic domains are well formed to allow such generation of data, which while low value in each instance, might allow valuable pre-training in aggregate.
Putting these elements together is an approach that could get us quite far.
The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.
gizmo686|2 years ago
You do not need to solve everything at once. This approach has the potential to revolution both math and programming by moving formal verification from being a niche tool into a regular part of every practitioners toolbox.
It also completely solves (within the domain it applies) one of the most fundamental problems of AI that the current round is calling "hallucinations"; however that solution only works because we have a non AI system to prove correctness.
At a high level, this approach is not really that new. Biochem has been using AI to help find candidate molecules, which are then verified by physical experimentation.
Combinatorical game AI has been using the AI as an input to old fasion monte carlo searches
uptownfunk|2 years ago
bnprks|2 years ago
I don't yet see a public copy of the training set / example training code, but still this is a good step towards providing something other researchers can build on -- which is after all the whole point of academic papers!
pk-protect-ai|2 years ago
Im I'm wrong that they are using t5 model? At least they are using the sentencepiece t5 vocabulary.
How many GPU hours have they spend training this model? Which training parameters were used?
Don't get me wrong, I find this system fascinating it is what applied engineering should look like. But I'd like to know more about the training details and the initial data they have used as well as the methods of synthetic data generation.
Imnimo|2 years ago
They say in the paper, "Because the language model decoding process returns k different sequences describing k alternative auxiliary constructions, we perform a beam search over these k options, using the score of each beam as its value function. This set-up is highly parallelizable across beams, allowing substantial speed-up when there are parallel computational resources. In our experiments, we use a beam size of k = 512, the maximum number of iterations is 16 and the branching factor for each node, that is, the decoding batch size, is 32."
But I don't totally understand how 512 and 16 translate into total number of constructions proposed. They also note that ablating beam size and max iterations seems to only somewhat degrade performance. Does this imply that the model is actually pretty good at putting helpful constructions near the top, and only for the hardest problems does it need to produce thousands?
kingkongjaffa|2 years ago
refulgentis|2 years ago
But let's try -- TL;DR 262,144, but don't take it literally:
- The output of a decoding function is a token. ~3/4 of a word. Let's just say 1 word.
- Tokens considered per token output = 262,144 Total number of token considerations for 1 output token = beam_size * branching_factor * max_iterations = 512 * 32 * 16 = 262,144.
- Let's take their sample solution and get a word count. https://storage.googleapis.com/deepmind-media/DeepMind.com/B...
- Total tokens for solution = 2289
- Total # of tokens considered = 600,047,616 = 262,144 * 2289
- Hack: ""number of solutions considered"" = total tokens considered / total tokens in solution
- 262,144 (same # as number of tokens we viewed at each iteration step, which makes sense)
apsec112|2 years ago
"We use the Meliad library for transformer training with its base settings. The transformer has 12 layers, embedding dimension of 1,024, eight heads of attention and an inter-attention dense layer of dimension 4,096 with ReLU activation. Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads. Our customized tokenizer is trained with ‘word’ mode using SentencePiece and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding. Sequence packing is also used because more than 90% of our sequences are under 200 in length."
albertzeyer|2 years ago
ipnon|2 years ago
zodiac|2 years ago
lacker|2 years ago
generationP|2 years ago
qrian|2 years ago
[1]: https://web.evanchen.cc/geombook.html
qrian|2 years ago
It has the right idea of O2 being on the nine point circle though.
edit: I retract my statement. It looks like it is using directed angles[2] which then the statement becomes correct.
[1]: https://storage.googleapis.com/deepmind-media/DeepMind.com/B...
[2]: https://web.evanchen.cc/handouts/Directed-Angles/Directed-An...
hcarlens|2 years ago
($10m prize for models that can perform well at IMO)
artninja1988|2 years ago
With all the bickering about copyright, could something similar be used for coding llms? Would kill the ip issues, at least for coding
cubefox|2 years ago
https://arxiv.org/abs/2207.14502
sangnoir|2 years ago
If you couple random text genrsrion and such an engine for language, you'd be laundering your training using the extra step (and the quality will likely be worse due to multiplicative errors)
nodogoto|2 years ago
gisearkr|2 years ago
That strikes me as odd, because I'd expect the symbolic engine to require zero such constructs to find a proof. Although dropping a bisector is the standard grade-school proof strategy, there's a well-known proof that defines no additional entities: since AB = AC, we have (by side-side-side) that △ABC ≅ △ACB, and therefore that ∠ABC = ∠ACB.
It's funny because what that proof is most famous for is having been rediscovered by an automated theorem prover in the 1950s-1960s. (Wikipedia states that this is a myth, and that it was only rediscovered by Marvin Minsky pretending to be an automated theorem prover... which is true; but then Herbert Gelernter later wrote such a theorem prover! [1])
[1] https://www.qedcat.com/function/27.3.pdf#page=18
TFortunato|2 years ago
FranchuFranchu|2 years ago
If this is legit, this is huge. Finding geometric proofs means proving things, and proving propositions is what an intelligence does. In my opinion, this is the closest we have gotten so far to AGI. Really excited to see what the future holds.
gowld|2 years ago
This is not a pure LLM. Search algorithms are obviously incredibly powerful at solving search problems. The LLM contribution is an "intuitive" way to optimize the search.
1024core|2 years ago
> When producing full natural-language proofs on IMO-AG-30, however, GPT-4 has a success rate of 0% ...
CamperBob2|2 years ago
Have you tried the same questions with ChatGPT 4? It is a transformational change (no pun intended) over the earlier releases, and over all open-source models.
Just today, I needed to interpret some awkwardly-timestamped log data. I asked it a few questions along the lines of "What time it was 10,000 seconds before xx:yy:zz?" It didn't give me the answer, but it wrote and executed a Python program that did.
summerlight|2 years ago
rafaelero|2 years ago
qwertox|2 years ago
As if they're stuck in a loop of "let's improve this a bit more until it is even better", while AGI is already a solved problem for them.
Or they're just an uncoordinated, chaotic bunch of AI teams without a leader who unifies them. With leader I don't mean Demis Hassabis, but rather someone like Sundar Pichai.
ks2048|2 years ago
achierius|2 years ago
hacketthfk|2 years ago
Not even one thing they did was without an associated Nature paper, and the paper was always first.
generationP|2 years ago
And an AI that generates nontrivial and correct mathematical proofs would, in turn, be a good first step towards an AI that can "think logically" in the common-sense meaning of the word (i.e., not just mess around with words and sentences but actually have some logical theory behind what it is saying). It might be a dead end, but it might not be. Even if it is, it will be a boon to mathematics.
nybsjytm|2 years ago
senseiV|2 years ago
Didnt they already have scaled down simulations of this?
apsec112|2 years ago
https://www.nature.com/articles/s41586-023-06747-5
unknown|2 years ago
[deleted]
riku_iki|2 years ago
hxypqr|2 years ago
compthink|2 years ago
generationP|2 years ago
lacker|2 years ago
zbyforgotp|2 years ago
topsycatt|2 years ago
j2kun|2 years ago
aconz2|2 years ago
They make a big deal about these auxiliary constructions but I'm not sure the results back that up. I also don't understand what the "without pretraining" means; they say without pretraining it gets 21 problems, without fine-tuning 23 problems, and altogether 25. So are they getting 21/25 without their synthetic data at all??
I want to look at how the dependency graph is encoded in the training examples and whether you might expect different results from different encodings. Ideally I'd think you'd want that part of the model/embedding to be invariant across all possible (big number!) of encodings.
The human comparisons are really weak in Fig 2 and unfortunately is probably what the media mainly focuses on. I do find the results interesting and wonder how well this translates into program synthesis for example. The paper mentioned in these comments which I have yet to read more "Language Models Can Teach Themselves to Program Better" sounds related. I'd be very interested to see more methods that are very different from LLM heuristic beam search, even though it does seem compelling and they get to publish lots of papers with it.
YeGoblynQueenne|2 years ago
In broad terms it's the same generate-and-test approach used for their AlphaCode system but with a theorem prover as a test step. In AlphaCode they use clustering and various heuristics to select programs, here they use the prover.
Eventually, I expect, they'll figure out you can use an automated theorem prover to check the correctness of (logic) programs.
ackfoobar|2 years ago
sega_sai|2 years ago
hyh1048576|2 years ago
marojejian|2 years ago
logicallee|2 years ago
To give you a basis for comparison, I've found that (paid) ChatGPT even quite recently was not able to reason about simple html and css geometric constructs, such as what div is inside what container: with just two divs or containers on the page, it still applied styling to the wrong one and I had to move the line myself to the one it meant to style, even after I described to ChatGPT what the page currently looked like.
I just tried it again, it did succeed with this prompt:
"Create an HTML page that you divide in half vertically, you will set the background of divs to different colors. After dividing the page in half vertically, on the left half of the page put three different divs each taking up a third of the vertical space, color them red, orange, yellow. On the right half of the page create two different divs each taking half of the vertical space, color them green, blue. Within each div put a paragraph in black letters with the color and location of the div it is located in, such as "Left hand side, top div. This background is red." and so on for top, middle, and bottom on the left and top and bottom on the right, with the colors red, orange, yellow, green and blue respectively as their backgrounds. Reply with the complete html and css style page in single html file that I can copy and paste into a single file, when loaded it should show the layout I have described." When I try this right now I get a successsful page.
Here is the code it generated: https://paste.sh/drpp3Kci#FoQHQT6YtMknnI-y2N7beqAd which you can see created this https://ibb.co/NFwnvWL
It previously failed at similarly layout questions.
If ChatGPT had this "Olympiad-level" geometric reasoning it would be amazing for producing html pages that include complex layout. It currently doesn't always succeed at this.
Layout is the area that I've found ChatGPT is weakest in for generating code, which I chalk up to the fact that it doesn't have a vision subsystem so it can't really picture about what it's doing.
sidcool|2 years ago
ipsum2|2 years ago
lacker|2 years ago
For an example, check out this formalization of a solution to a non-geometry IMO problem:
https://gist.github.com/mlyean/b4cc46cf6b0705c1226511a2b404d...
Roughly the same solution is written here in human language:
https://artofproblemsolving.com/wiki/index.php/1972_IMO_Prob...
It's not precisely clear what counts as a "step" but to me this seems like roughly 50 steps in the low-level formalized proof, and 5 steps in the human-language proof.
So, my conclusion is just that I wouldn't necessarily say that a 150-step automated proof was inelegant. It can also be that the language used to express these proofs fills in more of the details.
It's like the difference between counting lines of code in compiled code and a high-level language, counting the steps is really dependent on the level at which you express it.
xbmcuser|2 years ago
treprinum|2 years ago
ironlake|2 years ago
hiAndrewQuinn|2 years ago
That's cool by me, though. This bit of demoscene hardware experiences qualia, and that combination is weird and cool enough to make me want to push myself in new and weird directions. That's what play is in a way.
lacker|2 years ago
sdenton4|2 years ago
Agingcoder|2 years ago
jeanloolz|2 years ago
WalterSear|2 years ago
imranq|2 years ago
unknown|2 years ago
[deleted]
EvgeniyZh|2 years ago
gumballindie|2 years ago
dandanua|2 years ago
m3kw9|2 years ago
kfarr|2 years ago
WhitneyLand|2 years ago
Given enough compute, I wonder how much this would be improved by having it find as many solutions as possible within the allotted time and proposing the simplest one.
jacobolus|2 years ago
The human proof style instead chunks the parts of a solution into human-meaningful "lemmas" (helper theorems) and builds bodies of theory into well-defined and widely used abstractions like complex numbers or derivatives, with a corpus of accepted results.
Some human proofs of theorems also have a bit of this flavor of inscrutable lists of highly technical steps, especially the first time something is proven. Over time, the most important theorems are often recast in terms of a more suitable grounding theory, in which they can be proven with a few obvious statements or sometimes a clever trick or two.
spookie|2 years ago
unknown|2 years ago
[deleted]
JacobiX|2 years ago
graycat|2 years ago
AD = DE = EC.
wigster|2 years ago
empath-nirvana|2 years ago
Not to pick on this guy, but this is ridiculous goal post shifting. It's just astounding what people will hand-wave away as not requiring intelligence.
mb7733|2 years ago
An analogy with software is that math competitions are like (very hard) leetcode.
There was an article posted on HN recently that is related: https://benexdict.io/p/math-team
drawnwren|2 years ago
It isn't untrue, but both are considered metrics by the community. (Whether rightfully or not seems irrelevant to ML).
adverbly|2 years ago
I think the ultimate goal though is clearly that AI systems will be able to generate new high-value knowledge and proofs. I think the realistically the final goal post has always been at that point.
nybsjytm|2 years ago
Also if you read about the structure of AlphaGeometry, I think it's very hard to maintain that it "requires intelligence." As AI stuff goes, it's pretty comprehensible and plain.
GPerson|2 years ago
jksk61|2 years ago
unknown|2 years ago
[deleted]
1024core|2 years ago
> Note that the performance of GPT-4 performance on IMO problems can also be contaminated by public solutions in its training data.
Doesn't anybody proofread in Nature?
chefandy|2 years ago
dr_kiszonka|2 years ago
aldousd666|2 years ago
ChatGTP|2 years ago
The point of the forum is that people who are interested in these things can discuss them together openly. People are having discussions about the way they interpret the paper and what they think it means. You seem to be telling those people that's pointless and we should just accept this as a breakthrough because you told us so. Which I think goes against the spirit of this forum and the ability for people to look at the world with an objective lens.
If this work is as good as you say it is, and it very well might be, it will stand on it's own no matter what people say about it. Time will tell if this was meaningful or not.
ChatGTP|2 years ago
It’s our maths. No one else owns it?
Funny game the AI field is playing now. One-upmanship seems to be the aim of the game.
drdeca|2 years ago
“Our” doesn’t always imply ownership/title. I do not own my dad.