(no title)
mbh159
|
13 days ago
This is the right direction to understanding AI capabilities. Static benchmarks let models memorize answers while a 300-turn Magic game with hidden information and sequencing decisions doesn't. The fact that frontier model ratings are "artificially low" because of tooling bugs is itself useful data: raw capability ≠ practical performance under real constraints. Curious whether you're seeing consistent skill gaps between models in specific phases (opening mulligan decisions vs. late-game combat math), or if the rankings are uniform across game stages.
GregorStocks|13 days ago