It's incredible how accurate the Chatbot Arena Leaderboard [0] is at predicting model performance compared to benchmarks (which can and are being gamed, see all the 7B models on HF leaderboard)
It's because it isn't "predicting" anything, but rather aggregating user feedback. That is of course going to be closest to judging the subjective "best" model that pleases most people.
It's like saying how can evaluating 5 years of performance at work be better at predicting someone's competency than their SAT scores.
"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."
So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.
I wish that Arena included a few more "interesting" models like the new Phi-2 model and the current tinyllama model, which are trying to push the limits on small models. Solar-10.7B is another interesting model that seems to be missing, but I just learned about it yesterday, and it seems to have come out a week ago, so maybe it's too new. Solar supposedly outperforms Mixtral-8x7B with a fraction of the total parameters, although Solar seems optimized for single-turn conversation, so maybe it falls apart over multiple messages (I'm not sure).
It's much more accurate than the Open LLM Leaderboard, that's for sure. Human evaluation has always been the gold standard. I just wish we could filter by the votes which were made after only one or two prompts and I hope they don't include the non-blind votes in the results.
Thanks for the reference I was searching for a benchmark that can quantify the typical user experience, as most synthetic ones are completly ineffective. At what sample size the ranking become significant? Or is it baked in the metrics (ELO)?
Elo converges on stable scores fairly quickly, depending on the K-factor. I wouldn't think it would be much of an issue at all for something like this, since you can ensure you're testing against every other member (avoiding "Elo islands"). But obviously the more trials the better.
The Glicko rating system is very similar to Elo, but it also models the variance of a given rating. It can directly tell you a "rating deviation."
paxys|2 years ago
It's like saying how can evaluating 5 years of performance at work be better at predicting someone's competency than their SAT scores.
coder543|2 years ago
https://huggingface.co/papers/2306.05685
This paper makes the argument that...
"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."
So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.
coder543|2 years ago
GaggiX|2 years ago
unstuck3958|2 years ago
Phi-2 isn't fine tuned for instruction following yet.
s-macke|2 years ago
For example, consider my analysis [0] based on observing the progression of Large Language Models (LLMs) in a single text adventure.
[0] https://github.com/s-macke/AdventureAI#evaluation-of-other-m...
nabakin|2 years ago
GaggiX|2 years ago
-Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one!
-You can continue chatting until you identify a winner.
-Vote won’t be counted if model identity is revealed during conversation.
coder543|2 years ago
Do you really need more than this to know which one you’re going to pick? https://i.imgur.com/En37EJD.png
Avatar doesn’t have humans? Seriously?
londons_explore|2 years ago
_giorgio_|2 years ago
I only make technical (pytorch) questions though.
3abiton|2 years ago
bitshiftfaced|2 years ago
The Glicko rating system is very similar to Elo, but it also models the variance of a given rating. It can directly tell you a "rating deviation."
moffkalast|2 years ago
AdrienBrault|2 years ago
https://www.reddit.com/r/LocalLLaMA/comments/17jrj82/new_mic...
dannyw|2 years ago