top | item 39029801

AlphaGeometry: An Olympiad-level AI system for geometry

545 points| FlawedReformer | 2 years ago |deepmind.google

170 comments

order

szmerdi|2 years ago

As a former problem designer for IMO and similar contests, I deeply enjoyed reading this paper. At the same time, I'd like to point out that it was clear Geometry had to be the first topic to give up against AI (i.e., smart knowledge and inference-method indexing) Among math olympiad topics, Geometry problems are often the most "mechanical." Once you can express the problem in terms of coordinates (think XY or complex plane) you're looking at a finite set of steps a computer can use to find a solution. Of course, time limits and human error get in the way of this being practical during the IMO itself. Back in the day, I'd use WolframAlpha to verify geometry proofs for problems I designed (+conjectures) using this approach.

Algebra (especially inequalities) can be similar – pull off some intense calculations and you often have your answer.

Where I'm really excited to see intelligent systems make progress is with Number Theory and Combinatorics problems. The search spaces are far more complex and they often require proving that something is impossible. These are the problems that would be difficult to solve with brute force computation.

fovc|2 years ago

As a consumer of those problems, first of all, thank you! Even decades on from high school I occasionally enjoy working on them.

But agree that geometry was obviously going to be first. From what I’ve gathered here, it’s not “brute forcing” in terms of relying on algebraic geometry, vectors, or complex number solutions, but it is brute force in that it’s exhaustively looking for “interesting” constructions.

Geometry was always my worst subject, but even so I felt like if given the right construction the problem was much easier. Unfortunately I never developed the intuition to hit on those constructions quickly. It seems this AI doesn’t either, but it can churn through them much faster — There are only so many perpendiculars and parallels and bisectors you can construct which you can more or less mechanically evaluate (map out all angles and ratios, try power of a point, etc.)

While this is incredibly impressive, it seems like Deep Mind:Kasparov::AlphaGeo:Terry Tao in the “engine vs AI” sense.

I agree Algebra is likely next. Like geometry, more often than not you “just” need a clever substitution or 3, of which there are only so many options to choose from.

Some combinatorics problems I think would also be amenable to this search strategy (e.g., finding things to count 2 ways), but that seems like a bridge further away, and only gets you a subset of problems.

Number theory would be my guess for the final frontier before a perfect 42.

weatherlite|2 years ago

I like your positive attitude to this. Do you feel any sense of loss over the possibility your skill (being extremely good in math) might soon be overtaken by machines? Or do you see no possibility of that happening anytime soon?

Dracophoenix|2 years ago

How did you get into such a position? Is there an application process of some sort?

After verifying solvability, what is the process for selecting the specific problems that will show up in the finished set? Is it by vote or some other evaluation?

generationP|2 years ago

I expected three-number inequalities to give in first, as there is less of a question about what counts as a proof. I didn't know that the latter problem was already solved in 2000 ( http://www.mmrc.iss.ac.cn/~xgao/paper/jar-gdbase.pdf ).

Someone really should make an adventure game out of synthetic geometry, as it allows for a proof-writing language simpler than Lean's and allows for nice visuals.

nybsjytm|2 years ago

If I read their paper right, this is legit work (much more legit than DeepMind's AI math paper last month falsely advertised as solving an open math research problem) but it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.

A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work. [edit: as corrected by Imnimo, I got this backwards - the brute search is just pure brute search, the transformer is used to predict which extra geometric construction to add]

Also (not mentioned in the blog post) the actual problem statements had to be modified/adapted, e.g. the actual problem statement "Let AH1, BH2 and CH3 be the altitudes of a triangle ABC. The incircle W of triangle ABC touches the sides BC, CA and AB at T1, T2 and T3, respectively. Consider the symmetric images of the lines H1H2, H2H3, and H3H1 with respect to the lines T1T2, T2T3, and T3T1. Prove that these images form a triangle whose vertices lie on W." had to be changed to "Let ABC be a triangle. Define point I such that AI is the bisector of angle BAC and CI is the bisector of angle ACB. Define point T1 as the foot of I on line BC. Define T2 as the foot of I on line AC. Define point T3 as the foot of I on line AB. Define point H1 as the foot of A on line BC. Define point H2 as the foot of B on line AC. Define point H3 as the foot of C on line AB. Define point X1 as the intersection of circles (T1,H1) and (T2,H1). Define point X2 as the intersection of circles (T1,H2) and (T2,H2). Define point Y2 as the intersection of circles (T2,H2) and (T3,H2). Define point Y3 as the intersection of circles (T2,H3) and (T3,H3). Define point Z as the intersection of lines X1X2 and Y2Y3. Prove that T1I=IZ."

Imnimo|2 years ago

>A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work.

I don't think this is quite right. The brute force search is performed by a symbolic solver, not the transformer. When it runs out of new deductions, the transformer is asked to suggest possible extra constructions (not randomly added).

killerstorm|2 years ago

> it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.

How so? Reasoning is fundamentally a search problem.

The process you described is exactly the process humans use: i.e. make a guess about what's useful, try to work out details mechanically. If get stuck, make another guess, etc. So the process is like searching through a tree.

People figured out this process back in 1955 (and made a working prototype which can prove theorems): https://en.wikipedia.org/wiki/Logic_Theorist but it all hinges on using good 'heuristics'. Neural networks are relevant here as they can extract heuristics from data.

What do you think is "the usual idea of automated reasoning"? Some magic device which can solve any problem using a single linear pass?

alexey-salmin|2 years ago

> When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work

That's exactly how I was taught geometry in school and I hated it with my all guts. Only after making it into the math department of the university I learned to do it properly and to enjoy it.

hxypqr|2 years ago

The problem is that using LLM as a role for drawing auxiliary lines is too inefficient. It is hard to imagine people deploying a large number of machines to solve a simple IMO problem. This field must be in the early stage of development, and much work remains unfinished. A reasonable point of view is that the search part should be replaced by a small neural network, and the reasoning part should not be difficult, and does not require much improvement. Now is the time to use self-play to improve performance, treating the conclusions that need to be proved in plane geometry problems as a point in the diagram and the conditions as another point in the diagram. Then two players try to move towards each other as much as possible and share data, so that the contribution made by each player in this process can be used as an analogy for calculating wins and losses in Go, and thus improve performance through self-play.

NooneAtAll3|2 years ago

how did you edit your comment after it got visible to others?

marojejian|2 years ago

Though this particular model doesn't sound generalizable, the neuro-symbolic approach seems very promising to me:

- linking the (increasingly powerful) "system 1" tools that are most of current ML with more structured "system 2" tools, like logical proof generation, which can plan and and check the veracity / value of output. - System 2 chugs along 'till it gets stuck, then system 1 jumps in to provide an intuitive guess on what part of state space to check next. - Here they leveraged the ability to generate proofs by computer to create a data set of 100m proofs, enabling scalable self-supervised learning. Seem to me the symbolic domains are well formed to allow such generation of data, which while low value in each instance, might allow valuable pre-training in aggregate.

Putting these elements together is an approach that could get us quite far.

The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.

gizmo686|2 years ago

> The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.

You do not need to solve everything at once. This approach has the potential to revolution both math and programming by moving formal verification from being a niche tool into a regular part of every practitioners toolbox.

It also completely solves (within the domain it applies) one of the most fundamental problems of AI that the current round is calling "hallucinations"; however that solution only works because we have a non AI system to prove correctness.

At a high level, this approach is not really that new. Biochem has been using AI to help find candidate molecules, which are then verified by physical experimentation.

Combinatorical game AI has been using the AI as an input to old fasion monte carlo searches

uptownfunk|2 years ago

This I suspect is the closest chance to reaching some flavor of AGI

bnprks|2 years ago

I appreciate that the authors released code and weights with their paper! This is the first high-profile DeepMind paper I can recall that has runnable inference code + checkpoints released. (Though I'm happy to be corrected by earlier examples I've missed)

I don't yet see a public copy of the training set / example training code, but still this is a good step towards providing something other researchers can build on -- which is after all the whole point of academic papers!

pk-protect-ai|2 years ago

Yeap. I'm missing the datasets as well. They have generated 100M synthetic examples ... Were these examples generated with AlphaGeometry? Where is the filtering code and initial input to generate these synthetics?

Im I'm wrong that they are using t5 model? At least they are using the sentencepiece t5 vocabulary.

How many GPU hours have they spend training this model? Which training parameters were used?

Don't get me wrong, I find this system fascinating it is what applied engineering should look like. But I'd like to know more about the training details and the initial data they have used as well as the methods of synthetic data generation.

Imnimo|2 years ago

I'm very curious how often the LM produces a helpful construction. Surely it must be doing better than random chance, but is it throwing out thousands of constructions before it finds a good one, or is it able to generate useful proposals at a rate similar to human experts?

They say in the paper, "Because the language model decoding process returns k different sequences describing k alternative auxiliary constructions, we perform a beam search over these k options, using the score of each beam as its value function. This set-up is highly parallelizable across beams, allowing substantial speed-up when there are parallel computational resources. In our experiments, we use a beam size of k = 512, the maximum number of iterations is 16 and the branching factor for each node, that is, the decoding batch size, is 32."

But I don't totally understand how 512 and 16 translate into total number of constructions proposed. They also note that ablating beam size and max iterations seems to only somewhat degrade performance. Does this imply that the model is actually pretty good at putting helpful constructions near the top, and only for the hardest problems does it need to produce thousands?

refulgentis|2 years ago

IMHO: this bumps, hard, against limitations of language / human-machine analogies.

But let's try -- TL;DR 262,144, but don't take it literally:

- The output of a decoding function is a token. ~3/4 of a word. Let's just say 1 word.

- Tokens considered per token output = 262,144 Total number of token considerations for 1 output token = beam_size * branching_factor * max_iterations = 512 * 32 * 16 = 262,144.

- Let's take their sample solution and get a word count. https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

- Total tokens for solution = 2289

- Total # of tokens considered = 600,047,616 = 262,144 * 2289

- Hack: ""number of solutions considered"" = total tokens considered / total tokens in solution

- 262,144 (same # as number of tokens we viewed at each iteration step, which makes sense)

apsec112|2 years ago

Interesting that the transformer used is tiny. From the paper:

"We use the Meliad library for transformer training with its base settings. The transformer has 12 layers, embedding dimension of 1,024, eight heads of attention and an inter-attention dense layer of dimension 4,096 with ReLU activation. Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads. Our customized tokenizer is trained with ‘word’ mode using SentencePiece and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding. Sequence packing is also used because more than 90% of our sequences are under 200 in length."

albertzeyer|2 years ago

It's not tiny, this is a quite normal size outside the field of LLMs, e.g. normal-sized language models, or also translation models, or acoustic models. Some people even would call this large.

ipnon|2 years ago

This suggests there may be some more low hanging fruit in the hard sciences for transformers to bite at, so long as they can be properly formulated. This was not a problem of scaling it seems.

zodiac|2 years ago

The real TIL (to me) is that the previous state-of-the-art could solve 10 of these! I'd heard there was a decision algorithm for plane geometry problems but I didn't know it was a practical one. Some searching turned up http://www.mmrc.iss.ac.cn/~xgao/paper/book-area.pdf as a reference.

lacker|2 years ago

Yes, and even the non-neural-network, symbolic plus linear algebra component of AlphaGeometry is able to outperform the previous state of the art. So a decent amount of work here went into the components that aren't neural networks at all.

generationP|2 years ago

This is nice, but I suspect that a barycentric coordinate bash with the formulas in Evan Chen's book could also do 30% of the IMO (seeing that most of it is about triangles) on a modern laptop.

qrian|2 years ago

I was all ready to be skeptical for most of these kind of works its outputs are 'not like human proofs' but then I saw Evan Chen's quote that it is indeed clean human-readable proof. Evan Chen is a prominent member of Olympiad math community and an author of famous Olympiad geometry book[1] so this time I will have to concede that machines indeed have conquered part of the IMO problems.

[1]: https://web.evanchen.cc/geombook.html

qrian|2 years ago

But then again, there's an error in the proof of IMO P3, in Fig1.f and Step 26. of the full proof in supplimentary material[1], which it states that ∠GMD = ∠GO2D, which is incorrect. It should be ∠GMD + ∠GO2D = π. I am trying to follow its logic but I cannot parse Step 25. Did it hallucinate this step?

It has the right idea of O2 being on the nine point circle though.

edit: I retract my statement. It looks like it is using directed angles[2] which then the statement becomes correct.

[1]: https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

[2]: https://web.evanchen.cc/handouts/Directed-Angles/Directed-An...

artninja1988|2 years ago

>To train AlphaGeometry's language model, the researchers had to create their own training data to compensate for the scarcity of existing geometric data. They generated nearly half a billion random geometric diagrams and fed them to the symbolic engine. This engine analyzed each diagram and produced statements about their properties. These statements were organized into 100 million synthetic proofs to train the language model.

With all the bickering about copyright, could something similar be used for coding llms? Would kill the ip issues, at least for coding

sangnoir|2 years ago

For language, the symbolic engine itself would likely be trained on copyrighted input, unlike the geometry engine, since math & math facts are not covered by copyright.

If you couple random text genrsrion and such an engine for language, you'd be laundering your training using the extra step (and the quality will likely be worse due to multiplicative errors)

nodogoto|2 years ago

What statements about properties of randomly generated code snippets would be useful for coding LLMs? You would need to generate text explaining what each snippet does, but that would require an existing coding LLM, so any IP concerns would persist.

gisearkr|2 years ago

There's a worked example near the article's start proving that an isoceles triangle has equal angles: the so-called pons asinorum. The caption concludes with "In this example, just one construct is required."

That strikes me as odd, because I'd expect the symbolic engine to require zero such constructs to find a proof. Although dropping a bisector is the standard grade-school proof strategy, there's a well-known proof that defines no additional entities: since AB = AC, we have (by side-side-side) that △ABC ≅ △ACB, and therefore that ∠ABC = ∠ACB.

It's funny because what that proof is most famous for is having been rediscovered by an automated theorem prover in the 1950s-1960s. (Wikipedia states that this is a myth, and that it was only rediscovered by Marvin Minsky pretending to be an automated theorem prover... which is true; but then Herbert Gelernter later wrote such a theorem prover! [1])

[1] https://www.qedcat.com/function/27.3.pdf#page=18

TFortunato|2 years ago

I'm a big fan of approaches like this, that combine deep learning / newer techniques w/ "GOFAI" approaches.

FranchuFranchu|2 years ago

As soon as ChatGPT got released, I tried to make it solve IMO-style problems. It failed.

If this is legit, this is huge. Finding geometric proofs means proving things, and proving propositions is what an intelligence does. In my opinion, this is the closest we have gotten so far to AGI. Really excited to see what the future holds.

gowld|2 years ago

Proving propositions has been computer work for several decades.

This is not a pure LLM. Search algorithms are obviously incredibly powerful at solving search problems. The LLM contribution is an "intuitive" way to optimize the search.

1024core|2 years ago

From the paper:

> When producing full natural-language proofs on IMO-AG-30, however, GPT-4 has a success rate of 0% ...

CamperBob2|2 years ago

As soon as ChatGPT got released, I tried to make it solve IMO-style problems. It failed.

Have you tried the same questions with ChatGPT 4? It is a transformational change (no pun intended) over the earlier releases, and over all open-source models.

Just today, I needed to interpret some awkwardly-timestamped log data. I asked it a few questions along the lines of "What time it was 10,000 seconds before xx:yy:zz?" It didn't give me the answer, but it wrote and executed a Python program that did.

summerlight|2 years ago

It looks like there's some interesting works to connect ML with symbolic reasoning (or searching). I'm closer to layman in this area but IIUC the latter is known to be rife with yet-to-be-understood heuristics to prune out the solution space and ML models are pretty good at this area. I'm not in a position to suggest what needs to happen to further push the boundary, but in my impression it looks like the current significant blocker is that we don't really have a way to construct a self-feedback loop structure that consistently iterates and improves the model from its own output. If this can be done properly, we may see something incredible in a few years.

rafaelero|2 years ago

I didn't look further into their method, but it seems to me that symbolic reasoning is only important insofar as it makes the solution verifiable. That still is a glorious capability, but a very narrow one.

qwertox|2 years ago

Google DeepMind keeps publishing these things while having missed the AI-for-the-people train, that I'm getting more and more the feeling that they have such a huge arsenal of AI-tech piling up in their secret rooms, as if they're already simulating entire AI-based societies.

As if they're stuck in a loop of "let's improve this a bit more until it is even better", while AGI is already a solved problem for them.

Or they're just an uncoordinated, chaotic bunch of AI teams without a leader who unifies them. With leader I don't mean Demis Hassabis, but rather someone like Sundar Pichai.

ks2048|2 years ago

I don’t get the criticism. It seems like basic research and the kind of thing that would lead the way to “AGI” (combining llm-style prediction with logical reasoning). Unless you’re talking about what’s the point of publishing Nature papers - then it’s probably that the people involved want some concrete recognition in their work up to this point. And I supposed tech/investor press until they get something useful out from it.

achierius|2 years ago

I think Occam's razor would have something to say about the relative likelihood of those two options.

hacketthfk|2 years ago

They are optimizing for Nature papers, not general usability.

Not even one thing they did was without an associated Nature paper, and the paper was always first.

generationP|2 years ago

What they did here is one of the first steps towards an AI that generates nontrivial and correct mathematical proofs. (An alternative benchmark would be IMO-level inequalities. Other topics such as algebra, combinatorics and number theory are probably no easier than the rest of mathematics, thus less useful as stepping stones.)

And an AI that generates nontrivial and correct mathematical proofs would, in turn, be a good first step towards an AI that can "think logically" in the common-sense meaning of the word (i.e., not just mess around with words and sentences but actually have some logical theory behind what it is saying). It might be a dead end, but it might not be. Even if it is, it will be a boon to mathematics.

nybsjytm|2 years ago

This is at almost the polar opposite end of the spectrum from "AGI," it's centered on brute search.

senseiV|2 years ago

> simulating entire AI-based societies.

Didnt they already have scaled down simulations of this?

riku_iki|2 years ago

My understanding is that they encoded domain in several dozens of mechanical rules described in extended data table 1, and then did transformer-guided brute force search for solutions.

hxypqr|2 years ago

The problem is that LLM as a role for drawing auxiliary lines is too inefficient. It is hard to imagine people deploying a large number of machines to solve a simple IMO problem. This field must be in the early stage of development, and much work remains unfinished

compthink|2 years ago

Very interesting. I was suspicious they could get an LLM to do this alone, and reading the description they don't. It's still AI I would say, maybe even more so since it alternates between a generative machine learning model to come up with hypothesis points. But then it uses a optimizer to solve for those pointa and simplify the problem. So it really is a mathematical reasoning system. Also good to note that is tuned specificially to these types of geometry problems.

generationP|2 years ago

What deductive system are they using to verify the proofs? I'm asking because the conventions of olympiad geometry are slightly different from those of the rest of mathematics (you can make informal "general position" arguments; you can pooh-pooh issues of sign and betweenness; you can apply theorems to unstated limiting cases; etc.), and it is far from obvious to me how this logic can be formalized without causing contradictions.

lacker|2 years ago

They built their own symbolic geometry engine, called DDAR. The IMO judges will typically let you get away with much less rigor than a symbolic engine like this produces, so I don't think they are taking advantage of any of these conventions.

zbyforgotp|2 years ago

This sounds like the famous system 1 and system 2 thinking with the prover doing the systematic system 2 thinking and the llm doing the intuitive system 1 thinking.

topsycatt|2 years ago

Oh hey! I ghost-wrote an earlier article that references the same idea (https://blog.google/technology/ai/bard-improved-reasoning-go...). I wonder if they came to it independently or if they read mine and it planted some seeds? Either way, super cool work.

j2kun|2 years ago

They have been working on these ideas since like 2021 they said, so I don't think it's that recent.

aconz2|2 years ago

summary: generate 500M synthetic examples where you start with random constructions, deduce (not sure until saturation or fixed depth) and use each conclusion along with its dependency graph and original premises as a training example. The 500M reduces to 100M after dedupe and they used 7.5M CPU hours. 9M/100M used auxiliary/exogenous constructions (which I think of as like if you're doing this on program search and the program requires some constant pi that isn't derivable from the inputs). They encode the examples for a transformer and finetune on the auxiliary construction subset to eek out 2 more problems. Then beam search using the model to generate next term and logic engine to expand the closure until the conclusion is in the set.

They make a big deal about these auxiliary constructions but I'm not sure the results back that up. I also don't understand what the "without pretraining" means; they say without pretraining it gets 21 problems, without fine-tuning 23 problems, and altogether 25. So are they getting 21/25 without their synthetic data at all??

I want to look at how the dependency graph is encoded in the training examples and whether you might expect different results from different encodings. Ideally I'd think you'd want that part of the model/embedding to be invariant across all possible (big number!) of encodings.

The human comparisons are really weak in Fig 2 and unfortunately is probably what the media mainly focuses on. I do find the results interesting and wonder how well this translates into program synthesis for example. The paper mentioned in these comments which I have yet to read more "Language Models Can Teach Themselves to Program Better" sounds related. I'd be very interested to see more methods that are very different from LLM heuristic beam search, even though it does seem compelling and they get to publish lots of papers with it.

YeGoblynQueenne|2 years ago

>> I do find the results interesting and wonder how well this translates into program synthesis for example.

In broad terms it's the same generate-and-test approach used for their AlphaCode system but with a theorem prover as a test step. In AlphaCode they use clustering and various heuristics to select programs, here they use the prover.

Eventually, I expect, they'll figure out you can use an automated theorem prover to check the correctness of (logic) programs.

ackfoobar|2 years ago

Euclidean geometry is decidable. Does that make it easier for computers, compared to other IMO problems?

hyh1048576|2 years ago

Maybe it's mentioned somewhere but I missed it, could someone provide a list of the five problems that AlphaGeometry failed to solve? The paper mentioned IMO 2008p6 and IMO 2019p2. What are the other three?

logicallee|2 years ago

I wonder how well it would do with html page layout. It is nice to see advancements in geometric reasoning!

To give you a basis for comparison, I've found that (paid) ChatGPT even quite recently was not able to reason about simple html and css geometric constructs, such as what div is inside what container: with just two divs or containers on the page, it still applied styling to the wrong one and I had to move the line myself to the one it meant to style, even after I described to ChatGPT what the page currently looked like.

I just tried it again, it did succeed with this prompt:

"Create an HTML page that you divide in half vertically, you will set the background of divs to different colors. After dividing the page in half vertically, on the left half of the page put three different divs each taking up a third of the vertical space, color them red, orange, yellow. On the right half of the page create two different divs each taking half of the vertical space, color them green, blue. Within each div put a paragraph in black letters with the color and location of the div it is located in, such as "Left hand side, top div. This background is red." and so on for top, middle, and bottom on the left and top and bottom on the right, with the colors red, orange, yellow, green and blue respectively as their backgrounds. Reply with the complete html and css style page in single html file that I can copy and paste into a single file, when loaded it should show the layout I have described." When I try this right now I get a successsful page.

Here is the code it generated: https://paste.sh/drpp3Kci#FoQHQT6YtMknnI-y2N7beqAd which you can see created this https://ibb.co/NFwnvWL

It previously failed at similarly layout questions.

If ChatGPT had this "Olympiad-level" geometric reasoning it would be amazing for producing html pages that include complex layout. It currently doesn't always succeed at this.

Layout is the area that I've found ChatGPT is weakest in for generating code, which I chalk up to the fact that it doesn't have a vision subsystem so it can't really picture about what it's doing.

sidcool|2 years ago

And it's been open sourced. The code and model both!

ipsum2|2 years ago

Do normal proofs for Olympiad level geometry questions require 150+ steps like the one that was generated here? It seems inelegant.

lacker|2 years ago

When a human solves these problems, you might state a "step" in natural language. Like, "apply a polar transformation to this diagram". But that same single step, if you translate into a low-level DSL, could be dozens of different steps.

For an example, check out this formalization of a solution to a non-geometry IMO problem:

https://gist.github.com/mlyean/b4cc46cf6b0705c1226511a2b404d...

Roughly the same solution is written here in human language:

https://artofproblemsolving.com/wiki/index.php/1972_IMO_Prob...

It's not precisely clear what counts as a "step" but to me this seems like roughly 50 steps in the low-level formalized proof, and 5 steps in the human-language proof.

So, my conclusion is just that I wouldn't necessarily say that a 150-step automated proof was inelegant. It can also be that the language used to express these proofs fills in more of the details.

It's like the difference between counting lines of code in compiled code and a high-level language, counting the steps is really dependent on the level at which you express it.

xbmcuser|2 years ago

Would something like this help ai/computer see the world better for robotics and self driving etc. I am not a programmer or in this field but I don't know why intuitally reading the headline alone made me feel the world of robotic and self driving etc ie where ai interact with the physical world is about to change.

treprinum|2 years ago

What happens to science when the most talented kids won't be able to compete in ~2 years tops? Would our civilization reach plateau or start downward trajectory as there will be no incentive to torture oneself to become the best in the world? Will it all fade away like chess once computers started beating grandmasters?

ironlake|2 years ago

Chess is immensely more popular since Deep Blue beat Kasparov.

hiAndrewQuinn|2 years ago

I've accepted that the human form, with its ~3 pounds of not especially optimized brainpower, is basically going to be relegated to the same status as demoscene hardware for anything that matters after this century.

That's cool by me, though. This bit of demoscene hardware experiences qualia, and that combination is weird and cool enough to make me want to push myself in new and weird directions. That's what play is in a way.

lacker|2 years ago

The incentive to compete in the IMO is that it's fun to do math contests, it's fun to win math contests, and (if you think that far ahead as a high schooler) it looks good on your resume. None of that incentive will go away if the computers get better at math contests.

sdenton4|2 years ago

We just need a Netflix series about a geometry prodigy with excellent fashion sense.

Agingcoder|2 years ago

I never studied math to become the best in the world !

jeanloolz|2 years ago

Chess never faded away. It actually has never been as big as it is today.

WalterSear|2 years ago

The same thing that happened when they allowed calculators in the classroom.

imranq|2 years ago

Unless you can figure out AI that is available everywhere, all the time (even in the most remote jungle with no power), there will always be value in making humans smarter

EvgeniyZh|2 years ago

I don't think chess faded away

gumballindie|2 years ago

I suppose we can now replace high schoolers.

dandanua|2 years ago

and their teachers

m3kw9|2 years ago

So create ai modules for every niche subject and have a OpenAI like function calling system to inference them?

kfarr|2 years ago

Yes but I would argue geometry is not niche it is the foundation of reasoning about the physical world not just math problems

WhitneyLand|2 years ago

It was interesting. One of the reviewers noted that one of the generated proofs didn’t seem that elegant in his opinion.

Given enough compute, I wonder how much this would be improved by having it find as many solutions as possible within the allotted time and proposing the simplest one.

jacobolus|2 years ago

The issue is that the computer-generated proofs operate at a very low level, step by step, like writing a program in assembly language without the use of coherent structure.

The human proof style instead chunks the parts of a solution into human-meaningful "lemmas" (helper theorems) and builds bodies of theory into well-defined and widely used abstractions like complex numbers or derivatives, with a corpus of accepted results.

Some human proofs of theorems also have a bit of this flavor of inscrutable lists of highly technical steps, especially the first time something is proven. Over time, the most important theorems are often recast in terms of a more suitable grounding theory, in which they can be proven with a few obvious statements or sometimes a clever trick or two.

spookie|2 years ago

Elegance would require more than this. Reasoning, for one.

JacobiX|2 years ago

The most surprising observation for me is that the well known Wu’s Method for proving geometric theorems solved 10 problems ! AlphaGeometry solved 25 problems.

graycat|2 years ago

Given triangle ABC, by Euclidean construction, find D on AB and E on BC so that the lengths

AD = DE = EC.

wigster|2 years ago

the hounds of tindalos will be sniffing around this

empath-nirvana|2 years ago

“What the I.M.O. is testing is very different from what creative mathematics looks like for the vast majority of mathematicians,” he said. ---

Not to pick on this guy, but this is ridiculous goal post shifting. It's just astounding what people will hand-wave away as not requiring intelligence.

mb7733|2 years ago

No, that sort of thing has been said about math competitions for a long time. It's not a new argument put forward as something against AI.

An analogy with software is that math competitions are like (very hard) leetcode.

There was an article posted on HN recently that is related: https://benexdict.io/p/math-team

drawnwren|2 years ago

Yeah, this would be akin to saying "What leetcode is testing is very different from what professional programming looks like".

It isn't untrue, but both are considered metrics by the community. (Whether rightfully or not seems irrelevant to ML).

adverbly|2 years ago

I guess it kind of depends on what a goal post should represent. If it represents an incremental goal then obviously you are correct.

I think the ultimate goal though is clearly that AI systems will be able to generate new high-value knowledge and proofs. I think the realistically the final goal post has always been at that point.

nybsjytm|2 years ago

As said by others, that is an absolutely commonplace saying, albeit usually having nothing whatsoever to do with AI. See for instance https://terrytao.wordpress.com/career-advice/advice-on-mathe...

Also if you read about the structure of AlphaGeometry, I think it's very hard to maintain that it "requires intelligence." As AI stuff goes, it's pretty comprehensible and plain.

GPerson|2 years ago

I’m not sure if it’s goal post shifting or not, but it is a true statement.

jksk61|2 years ago

bruh! with 100 mln elements in the dataset it is basically printing out previously known problems during testing.

1024core|2 years ago

From the Nature article:

> Note that the performance of GPT-4 performance on IMO problems can also be contaminated by public solutions in its training data.

Doesn't anybody proofread in Nature?

chefandy|2 years ago

Not a subject matter expert, so forgive me if the question is unintentionally obtuse, but It seems like a reasonable statement. They seem to be inferring that problems with existing public solutions wouldn't be a good indicator of performance in solving novel problems-- likely a more important evaluative measure than how fast it can replicate an answer it's already seen. Since you couldn't know if that solution was in its training data, you couldn't know if you were doing that or doing a more organic series of problem solving steps, therefore contaminating it. What's the problem with saying that?

dr_kiszonka|2 years ago

Weird. I can't find the sentence you are quoting in the paper.

aldousd666|2 years ago

This is fascinating and inspiring progress! What's crazy to me is how many people are looking to put stakes in the ground in front of this, claiming breathlessly before anyone can speak, that this isn't real AI progress or this isn't actually doing it how humans supposedly do, and implying that therefore it's not a big deal. I'm sorry to break to you folks, this is just another sign post moment. One more event to mark our place in history as we sail on. And when we look back on it, there isn't going to be all the disclaimers people are trying to throw down now. We had imagenet and deep blue, we had alphazero, alphago, GPT4, alphafold, and now we have AlphaGeometry.As DJ Khaled likes to point out, this is.. Another one

ChatGTP|2 years ago

This reads like another "edge lord" or "karma farming "comment. Seems to be more and more frequent here lately.

The point of the forum is that people who are interested in these things can discuss them together openly. People are having discussions about the way they interpret the paper and what they think it means. You seem to be telling those people that's pointless and we should just accept this as a breakthrough because you told us so. Which I think goes against the spirit of this forum and the ability for people to look at the world with an objective lens.

If this work is as good as you say it is, and it very well might be, it will stand on it's own no matter what people say about it. Time will tell if this was meaningful or not.

ChatGTP|2 years ago

“Our AI system”

It’s our maths. No one else owns it?

Funny game the AI field is playing now. One-upmanship seems to be the aim of the game.

drdeca|2 years ago

The AI system that they produced.

“Our” doesn’t always imply ownership/title. I do not own my dad.