Show HN: Board Game Bench – arena-based evaluation of reasoning LLMs
2 points| bjterry | 11 months ago |boardgamebench.com
Since it's a competitive setup, and there are hundreds of board games that could be implemented, this arena approach shouldn't become instantly saturated like other benchmarks, although it's certainly possible for individual labs to finetune their models for the specific games selected.
A notable gap is the exclusion of o1 and Google's Gemini's 2.5. I may add o1 if there's enough interest, but the arena is a bit expensive to pay for out of pocket, and Gemini's rate limits were too low for me to add it right now.
unknown|11 months ago
[deleted]