(no title)
lambda | 11 days ago
Minimax passed this test, which even some SOTA models don't pass. But I haven't tried any agentic coding yet.
I wasn't able to allocate the full context length for Minimax with my current setup, I'm going to try quantizing the KV cache to see if I can fit the full context length into the RAM I've allocated to the GPU. Even at a 3 bit quant MiniMax is pretty heavy. Need to find a big enough context window, otherwise it'll be less useful for agentic coding. With Qwen3 Coder Next, I can use the full context window.
Yeah, I've also seen the occasional tool call looping in Qwen3 Coder Next, that seems to be an easy failure mode for that model to hit.
lambda|11 days ago
I might be able to squeeze a bit more out if I were running fully headless with my development on another machine, but I'm running everything on a single laptop.
So looks like for my setup, 64k context with an 8 bit quant is about as good as I can do, and I need to drop down to a smaller model like Qwen3 Coder Next or GPT-OSS 120B if I want to be able to use longer contexts.
lambda|10 days ago
Haven't tried different things like switching between Vulkan and ROCm yet.
But anyhow, that 17 tokens per second was on almost empty context. By the time I got to 30k tokens context or so, it was down in the 5-10 tokens per second, and even occasionally all the way down to 2 tokens per second.
Oh, and it looks like I'm filling up the KV cache sometimes, which is causing it to have to drop the cache and start over fresh. Yikes, that is why it's getting so slow.
Qwen3 Coder Next is much faster. MiniMax's thinking/planning seems stronger, but Qwen3 Coder Next is pretty good at just cranking through a bunch of tool calls and poking around through code and docs and just doing stuff. Also MiniMax seems to have gotten confused by a few things browsing around the project that I'm in that Qwen3 Coder Next picked up on, so it's not like it's universally stronger.