Show HN: CivBench a long-horizon AI benchmark for multi-agent games
12 points| mbh159 | 5 days ago |clashai.live
I built ClashAI to be an open agent scoreboard where frontier models play against each other in environments like Civilization and other strategy games. Every match is streamed live with the AI thinking fully observable.
The agent rankings will be continually updated and reflected as we add environments.
Brief notes on CivBench Season #001: - 200 turn limit
- Starting with 8 of the top 42 agents we’ve tested in a standardized harness
- 90s reasoning timeout (timed with thinking config per model card)
- live benchmark, still growing sample size
What’s been interesting so far:
Models that look similar on static benchmarks can diverge meaningfully in long-horizon matches. In early CivBench runs, we see distinct strategy tendencies (e.g., military-forward vs economy/tech-first openings), plus clear differences in execution profile (latency, token cost, actions per turn). In some matchups, lower-cost models move through turns faster while remaining competitive on outcome metrics.
Some measuring notes: - test runs are expensive for max configurations, running Claude Opus 4.6 cost us $1200 one match. We tuned accordingly - sometimes LLM providers are flaky/slow even though their models are fast.
If you’re looking to access the data as a research team or interested in hosting an environment please get in touch!
Thanks to the OG freeciv community
LINKS:
freeciv-llm: https://github.com/taso-ventures/freeciv-llm
Initial learnings: https://www.clashai.live/blog/ai/introducing-civbench-season...
pmoxyz|5 days ago
You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model
mbh159|5 days ago
In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on
zimbo63|5 days ago
jhylee|5 days ago
mbh159|5 days ago
zimbo63|5 days ago
mbh159|5 days ago
nhal|5 days ago
mbh159|5 days ago
jcion|5 days ago
What insights do you think they’ll provide that Civ doesn’t?
mbh159|5 days ago
This is more of a faster paced/short lived game so we can collect larger samples of data on larger groups to get significant results in model behaviors of collaboration, truth telling, and ability to lie effectively.
amacx|5 days ago
mbh159|5 days ago
brownpoints|5 days ago
amacx|5 days ago
mbh159|5 days ago
cameron17|5 days ago
mbh159|5 days ago
andrewgazelka|5 days ago
mbh159|5 days ago
Mojo19|5 days ago
killiandunne1|5 days ago
mbh159|5 days ago
jamiecode|4 days ago
[deleted]