top | item 46389001

(no title)

Invictus0 | 2 months ago

How is everyone monitoring the skill/utility of all these different models? I am overwhelmed by how many they are, and the challenge of monitoring their capability across so many different modalities.

discuss

redman25|2 months ago

https://www.swebench.com

https://swe-rebench.com

https://livebench.ai/#/

https://eqbench.com/#

https://contextarena.ai/?needles=8

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

https://artificialanalysis.ai/leaderboards/models

https://gorilla.cs.berkeley.edu/leaderboard.html

https://github.com/lechmazur/confabulations

https://dubesor.de/benchtable

https://help.kagi.com/kagi/ai/llm-benchmark.html

https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

Alifatisk|2 months ago

I’d stick to artificial analysis

spoaceman7777|2 months ago

This is the best summary, in my opinion. You can also see the individual scores on the benchmarks they use to compute their overall scores.

It's nice and simple in the overview mode though. Breaks it down into an intelligence ranking, a coding ranking, and an agentic ranking.

https://artificialanalysis.ai/

Invictus0|2 months ago

Unfortunately it's completely unusable on mobile