I hope better and cheaper models will be widely available because competition is good for the business.
However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the strong tendency to reward hacking, often write nonsensical test report while the tests actually failed. And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]
[1] https://artificialanalysis.ai/models/minimax-m2-1
amluto|17 days ago
I haven’t tried MiniMax, but GPT-5.2-Codex has this problem. Yesterday I watched it observe a Python type error (variable declared with explicit incorrect type — fix was trivial), and it added a cast. (“cast” is Python speak for “override typing for this expression”.) I told it to fix it for real and not use cast. So it started sprinkling Any around the program (“Any” is awful Python speak for “don’t even try to understand this value and don’t warn either”).
kimixa|16 days ago
whattheheckheck|16 days ago
osti|17 days ago
qinsignificance|17 days ago
It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.
This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.
I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own
edoceo|17 days ago
XCSme|17 days ago
Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash