You still need an algorithm to decide, for each game that you're simulating, what actual decisions get made. If that algorithm is dumb, then you might decide Mono-Red Burn is the best deck, not because it's the best deck but because the dumb algorithm can play Burn much better than it can play Storm, inflating Burn's win rate.In principle, LLMs could have a much higher strategy ceiling than deterministic decision-tree-style AIs. But my experience with mage-bench is that LLMs are probably not good enough to outperform even very basic decision-tree AIs today.
deadbabe|13 days ago
Worse, it’s difficult to tweak. For example, what if you want AIs that play at varying difficulties? Are you just gonna prompt the LLM “hey try to be kinda shitty at this but still somewhat good”?