top | item 46389001 (no title) Invictus0 | 2 months ago How is everyone monitoring the skill/utility of all these different models? I am overwhelmed by how many they are, and the challenge of monitoring their capability across so many different modalities. discuss order hn newest redman25|2 months ago https://www.swebench.comhttps://swe-rebench.comhttps://livebench.ai/#/https://eqbench.com/#https://contextarena.ai/?needles=8https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...https://artificialanalysis.ai/leaderboards/modelshttps://gorilla.cs.berkeley.edu/leaderboard.htmlhttps://github.com/lechmazur/confabulationshttps://dubesor.de/benchtablehttps://help.kagi.com/kagi/ai/llm-benchmark.htmlhttps://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard Alifatisk|2 months ago I’d stick to artificial analysis load replies (1) spoaceman7777|2 months ago This is the best summary, in my opinion. You can also see the individual scores on the benchmarks they use to compute their overall scores.It's nice and simple in the overview mode though. Breaks it down into an intelligence ranking, a coding ranking, and an agentic ranking.https://artificialanalysis.ai/ Invictus0|2 months ago Unfortunately it's completely unusable on mobile load replies (1)
redman25|2 months ago https://www.swebench.comhttps://swe-rebench.comhttps://livebench.ai/#/https://eqbench.com/#https://contextarena.ai/?needles=8https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...https://artificialanalysis.ai/leaderboards/modelshttps://gorilla.cs.berkeley.edu/leaderboard.htmlhttps://github.com/lechmazur/confabulationshttps://dubesor.de/benchtablehttps://help.kagi.com/kagi/ai/llm-benchmark.htmlhttps://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard Alifatisk|2 months ago I’d stick to artificial analysis load replies (1)
spoaceman7777|2 months ago This is the best summary, in my opinion. You can also see the individual scores on the benchmarks they use to compute their overall scores.It's nice and simple in the overview mode though. Breaks it down into an intelligence ranking, a coding ranking, and an agentic ranking.https://artificialanalysis.ai/ Invictus0|2 months ago Unfortunately it's completely unusable on mobile load replies (1)
redman25|2 months ago
https://swe-rebench.com
https://livebench.ai/#/
https://eqbench.com/#
https://contextarena.ai/?needles=8
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
https://artificialanalysis.ai/leaderboards/models
https://gorilla.cs.berkeley.edu/leaderboard.html
https://github.com/lechmazur/confabulations
https://dubesor.de/benchtable
https://help.kagi.com/kagi/ai/llm-benchmark.html
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
Alifatisk|2 months ago
spoaceman7777|2 months ago
It's nice and simple in the overview mode though. Breaks it down into an intelligence ranking, a coding ranking, and an agentic ranking.
https://artificialanalysis.ai/
Invictus0|2 months ago