It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.
pants2|2 months ago
airstrike|2 months ago
There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.
The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.
dotancohen|2 months ago
unknown|2 months ago
[deleted]
Legend2440|2 months ago
The only exception I can think of is models trained on synthetic data like Phi.
pembrook|2 months ago
Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)
astrange|2 months ago