In my short testing on a different MoE model, it does not perform well. I tried running Kimi-K2-Thinking-GGUF with the smallest unsloth quantization (UD-TQ1_0, 247 GB), and it ran at 0.1 tps. According to its guide, you should expect 5 tps if the whole model can fit into RAM+VRAM, but if mmap has to be used, then expect less than 1 tps which matches my test. This was on a Ryzen AI Max+ 395 using ~100 GB VRAM.
zozbot234|10 days ago
zardo|12 days ago
cpburns2009|12 days ago