top | item 47054647

(no title)

mbh159 | 13 days ago

This is the right direction to understanding AI capabilities. Static benchmarks let models memorize answers while a 300-turn Magic game with hidden information and sequencing decisions doesn't. The fact that frontier model ratings are "artificially low" because of tooling bugs is itself useful data: raw capability ≠ practical performance under real constraints. Curious whether you're seeing consistent skill gaps between models in specific phases (opening mulligan decisions vs. late-game combat math), or if the rankings are uniform across game stages.

discuss

GregorStocks|13 days ago

A lot of models (including Opus) keep insisting in their reasoning traces that going first can be a bad idea for control decks, etc, which I find pretty interesting - my understanding is that the consensus among pros is closer to "you should go first 99.999% of the time", but the models seem to want there to be more nuance. Beyond that, most of the really interesting blunders that I've dug into have turned out to be problems with the tooling (either actual bugs, or MCP tools with affordances that are a poor fit for how LLMs assume they work). I'm hoping that I'm close to the end of those and am gonna start getting to the real limitations of the models soon.