Study identifies weaknesses in how AI systems are evaluated
416 points| pseudolus | 3 months ago |oii.ox.ac.uk
Related: https://www.theregister.com/2025/11/07/measuring_ai_models_h...
416 points| pseudolus | 3 months ago |oii.ox.ac.uk
Related: https://www.theregister.com/2025/11/07/measuring_ai_models_h...
[+] [-] bubblelicious|3 months ago|reply
Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.
I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,
[+] [-] bjackman|3 months ago|reply
Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:
- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
- the benchmarks are almost never predictive of the performance of real world workloads anyway
- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.
AND this is a field where the economic incentives for accurate predictions are enormous.
In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.
Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!
[+] [-] ACCount37|3 months ago|reply
Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.
It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.
[+] [-] scuff3d|3 months ago|reply
[+] [-] liqilin1567|3 months ago|reply
This finding really shocked me
[+] [-] bofadeez|3 months ago|reply
[+] [-] andy99|3 months ago|reply
The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?
I think both of these takes are present to some extent in reality.
[+] [-] jimmySixDOF|3 months ago|reply
[+] [-] j45|3 months ago|reply
[+] [-] unknown|3 months ago|reply
[deleted]
[+] [-] Dave_Wishengrad|3 months ago|reply
[deleted]
[+] [-] instagraham|3 months ago|reply
https://www.happiesthealth.com/articles/future-of-health/hum...
It's a shifting goalpost, but one of the things that struck me was how some questions could still be trivial for a fairly qualified human (a doctor in this case) but difficult for an AI model. Reasoning, visual or logic, is built on a set of assumptions that are better gained through IRL experience than crawling datasets and matching answers.
This leads me to believe that much of the future for training AI models will lie in exposing them to "meatspace" and annotating their inferences, much like how we train a child. This is a long, long process, and one that is already underway at scale. But it's what might give us emergent intelligences rather than just a basket of competing yet somehow-magic thesaurus.
[+] [-] sroussey|3 months ago|reply
[+] [-] jstummbillig|3 months ago|reply
[+] [-] zeroonetwothree|3 months ago|reply
For example a test of “multiply 1765x9392” would have some correlation with human intelligence but it wouldn’t make sense to apply it to computers.
[+] [-] SV_BubbleTime|3 months ago|reply
We took objective computers, and made them generate subjective results. Isn’t this a problem that we already know there’s no solution to?
That grading subjectivity is just subjective itself.
[+] [-] pessimizer|3 months ago|reply
[+] [-] calpaterson|3 months ago|reply
I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.
And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.
I dunno what to do about it and am tending to just pick Gemini as a result.
[+] [-] ACCount37|3 months ago|reply
Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.
[+] [-] botro|3 months ago|reply
My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.
The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.
LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.
[+] [-] andai|3 months ago|reply
I saw a study where a prompt massively boosted one model's performance on a task, but significantly reduced another popular model's performance on the same task.
[+] [-] diamond559|3 months ago|reply
[+] [-] HPsquared|3 months ago|reply
[+] [-] 3abiton|3 months ago|reply
Reminder that in most cases, it's impossible to know if there is cross-contamination from the test set of the public benchmarks, as most LLMs are not truely open-source. We can't replicate them. So arguably it's worse in some cases, pretty much fraud if you account for the VC money pouring in. This is even more evident in unknown models from lesser known institutes like from UAE.
[+] [-] shanev|3 months ago|reply
[+] [-] bee_rider|3 months ago|reply
When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.
[+] [-] layer8|3 months ago|reply
However:
> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.
Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.
[+] [-] jvanderbot|3 months ago|reply
College exam takers use those tricks because they are on a time limit and are gaming the system. It's clever and wink wink nudge nudge ok everyone does it. But it's one tiny signal in a huge spectrum of things we use to evaluate people.
Instead, these metrics are gamed and presented as the entire multi special signal of competence for LLMs because it is literally impossible to say that success in one domain would translate the way it might with a good hire.
What I want is something I don't have to guard against gaming. Something conscientious and capable like my co workers. Until then it's google version 2 married to intellisense and I'm not letting do anything by itself.
[+] [-] novok|3 months ago|reply
[+] [-] gardnr|3 months ago|reply
[+] [-] joe_the_user|3 months ago|reply
That's a good point imo but we achieved this stuff by at least 2022 when ChatGPT was released. The thing about these giant black boxes is that they also fail to do things that directly human-written software ("computers") does easily. The inability to print text onto generated images or do general arithmetic is important. And sure, some of these limits look like "limits of humans". But it is important to avoid jumping from "they do this human-thing" to "they're like humans".
[+] [-] 6510|3 months ago|reply
edit: I forgot my point: calculating big numbers is not a real world problem anyone has.
[+] [-] nradov|3 months ago|reply
[+] [-] Forgeties79|3 months ago|reply
The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.
I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though
[+] [-] idontpost|3 months ago|reply
[deleted]
[+] [-] riskable|3 months ago|reply
Someone want to start? I've got a Yjs/CRDT collaborative editing bug that took like a week and a half of attempts with Claude Code (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many attempts to figure out. Even then they didn't really get it... Just came up with a successful workaround (which is good enough for me but still...).
Aside: You know what really moved the progress bar on finding and fixing the bug? When I had a moment of inspiration and made the frontend send all it's logs to the backend so the AIs could see what was actually happening on the frontend (near real-time). Really, I was just getting sick of manual testing and pasting the console output into the chat (LOL). Laziness FTW!
I have the Google Chrome Dev Tools MCP but for some reason it doesn't work as well :shrug:
[+] [-] SkyPuncher|3 months ago|reply
[+] [-] pahae|3 months ago|reply
My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.
As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.
I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.
Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?
[+] [-] lysace|3 months ago|reply
It's primarily marketing-driven. I think the technical parts of companies need to attempt to own this more.
It gets really weird when engineering priorities shift because of these mostly irrelevant benchmarks.
[+] [-] proc0|3 months ago|reply
> Our systematic review of 445 benchmarks reveals prevalent gaps that undermine the construct validity needed to accurately measure targeted phenomena
Intelligence has an element of creativity, and as such the true measurement would be on metrics related to novelty, meaning tasks that have very little resemblance to any other existing task. Otherwise it's hard to parse out whether it's solving problems based on pattern recognition instead of actual reasoning and understanding. In other words, "memorizing" 1000 of the same type of problem, and solving #1001 of that type is not as impressive as solving a novel problem that has never been seen before.
Of course this presents challenges to creating the tests because you have to avoid however many petabytes of training data these systems are trained with. That's where some of the illusion of intelligence arises from (illusion not because it's artificial, since there's no reason to think the brain algorithms cannot be recreated in software).
[+] [-] doctorpangloss|3 months ago|reply
[+] [-] lielvilla|3 months ago|reply
The big difference from LLMs is that we don’t really have production-grade, standardized benchmarks for long-form TTS. We need things like volume-stability across segments, speech-rate consistency, and pronunciation accuracy over a hard corpus.
I wrote up what this could look like here: https://lielvilla.com/blog/death-of-demo/
[+] [-] SurceBeats|3 months ago|reply
[+] [-] wolttam|3 months ago|reply
Even if it requires human evaluators at first, and even if the models completely suck at this task right now: it seems like the kind of task you'd want them to be good at, if you want these models to eventually carry out these tasks in embodied forms in the real world.
Just having the benchmark in the first place is what gives model makers something to optimize for.
[+] [-] rumble_poster|3 months ago|reply
[+] [-] naasking|3 months ago|reply
The upside is that these can all be generated and checked synthetically so large data sets are possible, in both formal and natural languages.
[+] [-] dehrmann|3 months ago|reply
[+] [-] AbrahamParangi|3 months ago|reply
[+] [-] hobs|3 months ago|reply
[+] [-] RA_Fisher|3 months ago|reply