top | item 44428443

(no title)

snyhlxde | 8 months ago

Pokémon Red is also becoming a go-to benchmark for testing the agentic abilities of advanced AI models. But is Pokémon Red actually a good eval for LLMs? We study this problem in a standardized setting and identify three big issues: 1⃣ Without scaffolding, even top models can’t play Pokémon Red. 2⃣ Eval setups differ across reports—no fair comparison. 3⃣ It’s expensive! Very low cost-effectiveness.

We share more details and findings in our blogpost: https://lmgame.org/#/blog/pokemon_red

discuss

No comments yet.