top | item 47032946

(no title)

For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5

discuss

plagiarist|14 days ago

Are smaller 2/3-bit quantizations worth running vs. a more modest model at 8- or 16-bit? I don't currently have the vRAM to match my interest in this

jncraton|14 days ago

2 and 3 bit is where quality typically starts to really drop off. MXFP4 or another 4-bit quantization is often the sweet spot.

AbstractGeo|14 days ago

IMO, they're worth trying - they don't become completely braindead at Q2 or Q3, if it's a large enough model, apparently. (I've had surprisingly decent experience with Q2 quants of large-enough models. Is it as good as a Q4? No. But, hey - if you've got the bandwidth, download one and try it!)

Also, don't forget that Mixture of Experts (MoE) models perform better than you'd expect, because only a small part of the model is actually "active" - so e.g. a Qwen3-whatever-80B-A3B would be 80 billion total, but 3 billion active- worth trying if you've got enough system ram for the 80 billion, and enoguh vram for the 3.

doctorpangloss|14 days ago

Simply and utterly impossible to tell in any objective way without your own calibration data, in which case, make your own post trained quantized checkpoints anyway. That said, millions of people out there make technical decisions on vibes all the time, and has anything bad happened to them? I suppose if it feels good to run smaller quantizations, do it haha.

unknown|14 days ago

[deleted]