top | item 45856964

(no title)

Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.

And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.

I dunno what to do about it and am tending to just pick Gemini as a result.

discuss

ACCount37|3 months ago

Ratings on LMArena are too easily gamed.

Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.

A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.

energy123|3 months ago

It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.

botro|3 months ago

This is something I've stuggled with for my site, I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for each model with different temperature values available as a toggles.

My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.

The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.

LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.

andai|3 months ago

>when we get a prompt working reliably on one model, we often have trouble porting it to another LLM

I saw a study where a prompt massively boosted one model's performance on a task, but significantly reduced another popular model's performance on the same task.

boccaff|3 months ago

Do you have any pointer to search for that?

diamond559|3 months ago

I'd rather quit then be forced to beta test idiocracy. What's your company so we can all avoid it?

HPsquared|3 months ago

Psychometric testing of humans has a lot of difficulties, too. It's hard to measure some things.

3abiton|3 months ago

> Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

Reminder that in most cases, it's impossible to know if there is cross-contamination from the test set of the public benchmarks, as most LLMs are not truely open-source. We can't replicate them. So arguably it's worse in some cases, pretty much fraud if you account for the VC money pouring in. This is even more evident in unknown models from lesser known institutes like from UAE.