top | item 46124029

(no title)

mbowcut2 | 2 months ago

It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

discuss

order

pants2|2 months ago

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

airstrike|2 months ago

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

dotancohen|2 months ago

How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?

Legend2440|2 months ago

I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

The only exception I can think of is models trained on synthetic data like Phi.

pembrook|2 months ago

If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)

astrange|2 months ago

Americans have an opposing bias via the phenomenon of "safe edgy", where for obvious reasons they're uncomfortable with being biased towards anyone who looks like a US minority, and redirect all that energy towards being racist to the French. So it's all balanced.