(no title)
agucova | 1 year ago
They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):
> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...
light_hue_1|1 year ago
The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.
This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.
In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.
There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.
nerdponx|1 year ago
agucova|1 year ago
sebzim4500|1 year ago
Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.
tux3|1 year ago
Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.
andrepd|1 year ago
TeMPOraL|1 year ago
Why surprisingly?
2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!
4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.
ekianjo|1 year ago
ak_111|1 year ago
andrepd|1 year ago
slashdave|1 year ago
equestria|1 year ago
ak_111|1 year ago
llm_trw|1 year ago
The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.
What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.
Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.
nopinsight|1 year ago
This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?
youoy|1 year ago
benchmarkist|1 year ago
dr_dshiv|1 year ago
Source?