top | item 47067743

(no title)

lambda | 11 days ago

OK, with MiniMax M2.5 UD-Q3_K_XL (101 GiB), I can't really seem to fit the full context in even at smaller quants. Going up much above 64k tokens, I start to get OOM errors when running Firefox and Zed alongside the model, or just failure to allocate the buffers, even going down to 4 bit KV cache quants (oddly, 8 bit worked better than 4 or 5 bit, but I still ran into OOM errors).

I might be able to squeeze a bit more out if I were running fully headless with my development on another machine, but I'm running everything on a single laptop.

So looks like for my setup, 64k context with an 8 bit quant is about as good as I can do, and I need to drop down to a smaller model like Qwen3 Coder Next or GPT-OSS 120B if I want to be able to use longer contexts.

discuss

order

lambda|11 days ago

After some more testing, yikes, MiniMax M2.5 can get painfully slow on this setup.

Haven't tried different things like switching between Vulkan and ROCm yet.

But anyhow, that 17 tokens per second was on almost empty context. By the time I got to 30k tokens context or so, it was down in the 5-10 tokens per second, and even occasionally all the way down to 2 tokens per second.

Oh, and it looks like I'm filling up the KV cache sometimes, which is causing it to have to drop the cache and start over fresh. Yikes, that is why it's getting so slow.

Qwen3 Coder Next is much faster. MiniMax's thinking/planning seems stronger, but Qwen3 Coder Next is pretty good at just cranking through a bunch of tool calls and poking around through code and docs and just doing stuff. Also MiniMax seems to have gotten confused by a few things browsing around the project that I'm in that Qwen3 Coder Next picked up on, so it's not like it's universally stronger.

sosodev|10 days ago

Thanks for the additional info. I suspected that MiniMax M2.5 might be a bit too much for this board. 230B-A10B is just a lot to ask of the 395+ even with aggressive quantization. Particularly when you consider that the model is going to spend a lot of tokens thinking and that will eat into the comparatively smaller context window.

I switched from the Unsloth 4-bit quant of Qwen3 Coder Next to the official 4-bit quant from Qwen. Using their recommended settings I had it running with OpenCode last night and it seemed to be doing quite well. No infinite loops. Given its speed, large context window, and willingness to experiment like you mentioned I think it might actually be the best option for agentic coding on the 395+ for now.

I am curious about https://huggingface.co/stepfun-ai/Step-3.5-Flash given that it does parallel token generation. It might be fast enough despite being similar in size to M2.5. However, it seems there are still some issues that llama.cpp and stepfun need to work out before it's ready for everyday use.