AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

lispisok|3 months ago

There is way too much money being thrown at AI to not game/cheat the benchmarks

I am amazed not a single pro AI person on HN has anything to say or even speculate about this. This is such a serious issue

simianwords|3 months ago

This is a very poor article. What I understood is that they take one benchmark (in particular) that tests grade school level math. This benchmark apparently claims to test ability to reason through math problems.

They agree that the benchmarks show that the LLMs can solve such questions and models are getting better. But their main point is that this does not prove that the model is reasoning.

But so what??? It may not reason in the way humans do but it is pretty damn close. The mechanics are the same - recursively generate a prompt that terminates in an answer generating prompt.

They don’t like that this indicates the model “reasons through” the problem. But it’s just semantics at this point. For me and for most others - getting the final answer is important. And it largely accomplishes this task.

I don’t buy that the model couldn’t reason through - have you ever asked a model for its explanation? It does genuinely explain how it got the solution. At this point who the hell cares what “reasoning” means if it

1. Gets me the right answer

2. Reasonably explains how it did it

ulfw|3 months ago

Because the pro AI persons are busy trying sell their whatevertheyhave before the bubble bursts

Khaine|3 months ago

I'm shocked, shocked, that AI is optimised to pass bogus benchmarks.

Just like how GPUs were optimised to pass synthetic benchmarks.

simianwords|3 months ago

“ When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced “significant performance drops.””

This is very misleading because the generalisation ability of LLMs is very very high. It doesn’t just memorise problems - that’s nonsense.

At high school level maths you genuinely can’t get gpt-5 thinking to make a single mistake. Not possible at all. Unless you give some convoluted ambiguous prompt that no human can understand. If you assume I’m correct, how does gpt memorise then?

In fact even undergraduate level mathematics is quite simple for gpt-5 thinking.

IMO gold was won.. by what? Memorising solutions?

I challenge people to find ONE example that gpt-5 thinking gets wrong in high school or undergrad level maths. I could not achieve it. You must allow all tools though.

YeGoblynQueenne|3 months ago

The best performance on GSM8K is currently at 0.973, so less than perfect [1]. Given that GSM8K is a grade school math question data set, and the leading LLMs still don't get all answers correctly it's safe to assume that they won't get all high school questions' answers correctly either, since those are going to be harder than grade school questions. This means there has got to be at least one example that GPT-5 as well as every other LLM fails on [2].

If you don't think that's the case I think it's up to you to show that it's not.

___________________

[1] GSM8K leaderboard: https://llm-stats.com/benchmarks/gsm8k

[2] This is regardless of what GSM8K or any other benchmark is measuring.

geoduck14|3 months ago

>At high school level maths you genuinely can’t get gpt-5 thinking to make a single mistake. Not possible at all.

If you give an LLM an incomplete question, it will guess at an answer. They don't know what they don't know, and they are trained to guess

autop0ietic|3 months ago

I would think GPT5 is great at high school level math but what high school level math problems are not in the training data?

I think the problem is that GPT5 is not "memorising" but conversely that doesn't automatically mean it is "reasoning". These are human attributes that we are trying to equate to machines and it just causes confusion.

callmesnek|3 months ago

"You must allow all tools though"

17 comments