Personally I think LLM benchmarks make agents worse. All these companies chase the benchmarks, overfit, and think being able to cheat at the math olympiad is gonna get us to AGI. Instead researchers should peer in and get me an agent that can reliably count the number of "i"'s in mississippi.
upperhalfplane|6 months ago
The best way to measure intelligence is probably to have a model know its strengths and weaknesses, and deal with them in an efficient way. And the most important thing for eval is that ability.