top | item 45856804

Study identifies weaknesses in how AI systems are evaluated

416 points| pseudolus | 3 months ago |oii.ox.ac.uk

Paper: https://openreview.net/pdf?id=mdA5lVvNcU

Related: https://www.theregister.com/2025/11/07/measuring_ai_models_h...

192 comments

order
[+] bubblelicious|3 months ago|reply
I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.

Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.

I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,

[+] bjackman|3 months ago|reply
For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.

Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:

- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")

- the benchmarks are almost never predictive of the performance of real world workloads anyway

- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.

AND this is a field where the economic incentives for accurate predictions are enormous.

In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.

Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!

[+] ACCount37|3 months ago|reply
A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.

Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.

It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.

[+] scuff3d|3 months ago|reply
The big problem is that tech companies and journalist aren't transparent about this. They tout benchmark numbers constantly, like they're an object measure of capabilities.
[+] liqilin1567|3 months ago|reply
> Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem

This finding really shocked me

[+] bofadeez|3 months ago|reply
Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.
[+] andy99|3 months ago|reply
I also work in LLM evaluation. My cynical take is that nobody is really using LLMs for stuff, and so benchmarks are mostly just make up tasks (coding is probably the exception). If we had real specific use cases it should be easier to benchmark and know if one is better, but it’s mostly all hypothetical.

The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?

I think both of these takes are present to some extent in reality.

[+] jimmySixDOF|3 months ago|reply
Terminal Bench 2.0 just dropped and a big success factor they stress is the hand crafted phd level rollout tests they picked aprox 80 out of 120 with the incentive that anyone who contributed 3 would get listed as a paper author this resulted in high quality participation equivalent to foundation labs proprietary agentic RL data but it's FOSS.
[+] j45|3 months ago|reply
What gets measured, gets managed and improved, though.
[+] instagraham|3 months ago|reply
I've written about Humanity's Last Exam, which crowdsources tough questions for AI models from domain experts around the world.

https://www.happiesthealth.com/articles/future-of-health/hum...

It's a shifting goalpost, but one of the things that struck me was how some questions could still be trivial for a fairly qualified human (a doctor in this case) but difficult for an AI model. Reasoning, visual or logic, is built on a set of assumptions that are better gained through IRL experience than crawling datasets and matching answers.

This leads me to believe that much of the future for training AI models will lie in exposing them to "meatspace" and annotating their inferences, much like how we train a child. This is a long, long process, and one that is already underway at scale. But it's what might give us emergent intelligences rather than just a basket of competing yet somehow-magic thesaurus.

[+] sroussey|3 months ago|reply
Mercor is doing doing nine digit per year revenue doing just that. Micro1 and others also.
[+] jstummbillig|3 months ago|reply
Benchmarks are like SAT scores. Can they guarantee you'll be great at your future job? No, but we are still roughly okay with what they signify. Clearly LLMs are getting better in meaningful ways, and benchmarks correlate with that to some extend.
[+] zeroonetwothree|3 months ago|reply
There’s no a priori reason to expect a test designed to test human academic performance would be a good one to test LLM job performance.

For example a test of “multiply 1765x9392” would have some correlation with human intelligence but it wouldn’t make sense to apply it to computers.

[+] SV_BubbleTime|3 months ago|reply
Isn’t this like grading art critics?

We took objective computers, and made them generate subjective results. Isn’t this a problem that we already know there’s no solution to?

That grading subjectivity is just subjective itself.

[+] pessimizer|3 months ago|reply
People often use "clearly" or "obviously" to elide the subject that is under discussion. People are saying that they do not think that it is clear that LLMs are getting better in meaningful ways, and they are saying that the benchmarks are problematic. "Clearly" isn't a counterargument.
[+] calpaterson|3 months ago|reply
Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.

And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.

I dunno what to do about it and am tending to just pick Gemini as a result.

[+] ACCount37|3 months ago|reply
Ratings on LMArena are too easily gamed.

Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.

A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.

[+] botro|3 months ago|reply
This is something I've stuggled with for my site, I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for each model with different temperature values available as a toggles.

My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.

The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.

LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.

[+] andai|3 months ago|reply
>when we get a prompt working reliably on one model, we often have trouble porting it to another LLM

I saw a study where a prompt massively boosted one model's performance on a task, but significantly reduced another popular model's performance on the same task.

[+] diamond559|3 months ago|reply
I'd rather quit then be forced to beta test idiocracy. What's your company so we can all avoid it?
[+] HPsquared|3 months ago|reply
Psychometric testing of humans has a lot of difficulties, too. It's hard to measure some things.
[+] 3abiton|3 months ago|reply
> Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

Reminder that in most cases, it's impossible to know if there is cross-contamination from the test set of the public benchmarks, as most LLMs are not truely open-source. We can't replicate them. So arguably it's worse in some cases, pretty much fraud if you account for the VC money pouring in. This is even more evident in unknown models from lesser known institutes like from UAE.

[+] shanev|3 months ago|reply
This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).
[+] bee_rider|3 months ago|reply
> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."

When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.

[+] layer8|3 months ago|reply
I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

However:

> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.

Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.

[+] jvanderbot|3 months ago|reply
Absolutely not.

College exam takers use those tricks because they are on a time limit and are gaming the system. It's clever and wink wink nudge nudge ok everyone does it. But it's one tiny signal in a huge spectrum of things we use to evaluate people.

Instead, these metrics are gamed and presented as the entire multi special signal of competence for LLMs because it is literally impossible to say that success in one domain would translate the way it might with a good hire.

What I want is something I don't have to guard against gaming. Something conscientious and capable like my co workers. Until then it's google version 2 married to intellisense and I'm not letting do anything by itself.

[+] novok|3 months ago|reply
IMO I think the calculator problem goes away with tool use or NN architectures that basically add a calculator equivalent as one of the potential 'experts' or similar. It won't be much of a trope for longer.
[+] joe_the_user|3 months ago|reply
The point of these LLMs is to do things that computers were bad at.

That's a good point imo but we achieved this stuff by at least 2022 when ChatGPT was released. The thing about these giant black boxes is that they also fail to do things that directly human-written software ("computers") does easily. The inability to print text onto generated images or do general arithmetic is important. And sure, some of these limits look like "limits of humans". But it is important to avoid jumping from "they do this human-thing" to "they're like humans".

[+] 6510|3 months ago|reply
I don't claim to know anything but I thought tool usage was a major sign of intelligence. For example floats are a wonderful technology but people use them as if chainsaws are great for cutting bread and butter. We now have entire languages that cant do basic arithmetic. I thought it was alarming: People it cant compute like this! Now we have language models, those are still computers, why cant we just give them.. you know... calculators? Arguably the best thing their universe has to offer.

edit: I forgot my point: calculating big numbers is not a real world problem anyone has.

[+] nradov|3 months ago|reply
LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.
[+] Forgeties79|3 months ago|reply
>the point of these LLMs is to do things that computers were bad at.

The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.

I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though

[+] riskable|3 months ago|reply
We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.

Someone want to start? I've got a Yjs/CRDT collaborative editing bug that took like a week and a half of attempts with Claude Code (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many attempts to figure out. Even then they didn't really get it... Just came up with a successful workaround (which is good enough for me but still...).

Aside: You know what really moved the progress bar on finding and fixing the bug? When I had a moment of inspiration and made the frontend send all it's logs to the backend so the AIs could see what was actually happening on the frontend (near real-time). Really, I was just getting sick of manual testing and pasting the console output into the chat (LOL). Laziness FTW!

I have the Google Chrome Dev Tools MCP but for some reason it doesn't work as well :shrug:

[+] SkyPuncher|3 months ago|reply
Benchmarks are nothing more than highly contextual specs (in traditional code). They demonstrate your code works in a certain way in certain use cases, but they do not prove your code works as expected in all use cases.
[+] pahae|3 months ago|reply
I wish the big providers would offer some sort of trial period where you can evaluate models in a _realistic_ setting yourself (i.e cli tools or IDE integrations). I wouldn't even mind strict limits -- just give me two hours or so of usage and I'd already be happy. Seriously.

My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.

As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.

I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.

Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?

[+] lysace|3 months ago|reply
Tech companies/bloggers/press/etc are perpetually bad at benchmarks. For browsers they kept pushing simplistic javascript-centric benchmarks even when it was clear for at least 15 years that layout/paint/network/etc were the dominant bottlenecks in real-world usage.

It's primarily marketing-driven. I think the technical parts of companies need to attempt to own this more.

It gets really weird when engineering priorities shift because of these mostly irrelevant benchmarks.

[+] proc0|3 months ago|reply
This wasn't that hard to see.

> Our systematic review of 445 benchmarks reveals prevalent gaps that undermine the construct validity needed to accurately measure targeted phenomena

Intelligence has an element of creativity, and as such the true measurement would be on metrics related to novelty, meaning tasks that have very little resemblance to any other existing task. Otherwise it's hard to parse out whether it's solving problems based on pattern recognition instead of actual reasoning and understanding. In other words, "memorizing" 1000 of the same type of problem, and solving #1001 of that type is not as impressive as solving a novel problem that has never been seen before.

Of course this presents challenges to creating the tests because you have to avoid however many petabytes of training data these systems are trained with. That's where some of the illusion of intelligence arises from (illusion not because it's artificial, since there's no reason to think the brain algorithms cannot be recreated in software).

[+] doctorpangloss|3 months ago|reply
The problem with the LLM benchmarks is that if you see one that shows high performance by something that isn’t from Anthropic, Google or OpenAI, you don’t believe it, even if it were “true.” In that sense, benchmarks are a holistic social experience in this domain, less a scientific endeavour.
[+] lielvilla|3 months ago|reply
I’m working a lot with TTS (Text-to-Speach), and it’s also a total wild west - even worse than LLMs in some ways. The demos are always perfect, but once you generate hundreds of minutes you start seeing volume drift, pacing changes, random artifacts, and occasional mispronunciations that never show up in the curated clips.

The big difference from LLMs is that we don’t really have production-grade, standardized benchmarks for long-form TTS. We need things like volume-stability across segments, speech-rate consistency, and pronunciation accuracy over a hard corpus.

I wrote up what this could look like here: https://lielvilla.com/blog/death-of-demo/

[+] SurceBeats|3 months ago|reply
Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.
[+] wolttam|3 months ago|reply
I'd like to see some video generation benchmarks. For example, one that tested a model's ability to generate POV footage of a humanoid form carrying out typical household tasks

Even if it requires human evaluators at first, and even if the models completely suck at this task right now: it seems like the kind of task you'd want them to be good at, if you want these models to eventually carry out these tasks in embodied forms in the real world.

Just having the benchmark in the first place is what gives model makers something to optimize for.

[+] rumble_poster|3 months ago|reply
On top of this, we have complex infrastructure setups with different TPUs and GPUs across multiple data centers. The benchmarks tested when models were released might not reflect what we're actually using now. We need to evaluate models continuously, for example, https://isitnerfed.org/ does exactly that.
[+] naasking|3 months ago|reply
Clearly we need tests that check for effectiveness at applying general mathematical, logical and relational operations, eg. set theory, relational algebra, first and second order logic, type theory, the lambda calculus, recurrence and induction, etc., and the ability to use these to abstract over specifics and the ability to generalize.

The upside is that these can all be generated and checked synthetically so large data sets are possible, in both formal and natural languages.

[+] dehrmann|3 months ago|reply
This might explain the zeitgeist that new models feel same-ish, despite model developers saying they're getting spectacularly better.
[+] AbrahamParangi|3 months ago|reply
A test doesn't need to be objectively meaningful or rigorous in any sense in order to still be useful for comparative ranking.
[+] hobs|3 months ago|reply
yes it does - it has to be meaningful or rigorous for the comparative ranking to be meaningful or rigorous, or else wtf are you doing? Say I have all the information on my side but only these questions that you are showing the user? Who cares about that comparison?
[+] RA_Fisher|3 months ago|reply
For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).