Totally agree with this. I have seen many cases where a dumber model gets trapped in a local minima and burns a ton of tokens to escape from it (sometimes unsuccessfully). In a toy example (30 minute agentic coding session - create a markdown -> html compiler using a subset of commonmark test suite to hill climb on), dumber models would cost $18 (at retail token prices) to complete the task. Smarter models would see the trap and take only $3 to complete the task. YMMV.Much better to look at cost per task - and good to see some benchmarks reporting this now.
IgorPartola|3 months ago
brianjking|3 months ago