Pokémon Red is also becoming a go-to benchmark for testing the agentic abilities of advanced AI models. But is Pokémon Red actually a good eval for LLMs? We study this problem in a standardized setting and identify three big issues:
1⃣ Without scaffolding, even top models can’t play Pokémon Red.
2⃣ Eval setups differ across reports—no fair comparison.
3⃣ It’s expensive! Very low cost-effectiveness.We share more details and findings in our blogpost:
https://lmgame.org/#/blog/pokemon_red
No comments yet.