top | item 42097240

(no title)

agucova | 1 year ago

For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination).

They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):

> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”

Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...

discuss

light_hue_1|1 year ago

If I was going to bet, I would bet yes, they will reach above 85% performance.

The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.

This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.

In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.

There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.

nerdponx|1 year ago

Could you run the benchmark by bootstrapping (average of repeated subsampling), instead of a straight-across performance score, and regain some leakage resistance that way? As well as a better simulation of "out of sample" data, at least for a little while.

agucova|1 year ago

This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.

sebzim4500|1 year ago

>Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.

tux3|1 year ago

Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it

Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.

andrepd|1 year ago

Of course lol. How come e.g. o1 scores so high on these reasoning and math and IMO benchmarks and then fails every simple question I ask of it? The answer is training on the test set.

TeMPOraL|1 year ago

> Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Why surprisingly?

2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!

4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.

ekianjo|1 year ago

Sure but it is also reasonable to consider that the pace of progress is not always exponential or even linear at best. Diminishing returns are a thing and we already know that a 405b model is not 5 times better than a 70b model.

ak_111|1 year ago

I think because if you end up having an AI that is as capable as the graduate students Tao is used to dealing with (so basically potential field medalists) then you are basically betting that 85% chance something like AGI (at least in consequence) will be here in 3 years. It is possible, but 85% chance?

andrepd|1 year ago

People really love pointing at the first part of a logistic curve and go "behold! an exponential".

slashdave|1 year ago

Except LLM capabilities have already peaked. Scaling has rapidly diminishing returns.

equestria|1 year ago

Market size matters. There's a whopping total of 71 bidders on that.

ak_111|1 year ago

Would be interesting to know which model solved the 2% and what is the nature of the problems it solved.

llm_trw|1 year ago

These benchmarks are entirely pointless.

The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.

What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.

Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.

nopinsight|1 year ago

> Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph.

This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?

youoy|1 year ago

Not to mention that math proofs are more than graph trasversals... (Although maybe simple math problems are not) There is the problem of extracting the semantics of math formalisms. This is easier in day to day language, I don't know to what extent LLMs can also extract the semantics and relations of different mathematical abstractions.

benchmarkist|1 year ago

It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.

dr_dshiv|1 year ago

> they’re merely regurgitating memorized information

Source?