Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!
Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.
What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
This is pretty interesting, based on the blog post, it seems like they are using a technique similar to what I have been using to generate "layer sensitivity" data in my (still pretty beta) ggufy project, which is more aimed at diffusion (image) models.
https://github.com/qskousen/ggufy
I run Llama 3.2 3B locally for latency-sensitive classification (sub-50ms, so no room for bigger models). At that scale Q2_K vs Q4_K_M isn't just smaller — Q2 starts flipping yes/no answers that Q4 gets right. Not often, but enough to notice in production.
So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.
Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.
For a simple classification task you generally want to prioritize regularization over more sophisticated behavior, so fewer parameters with larger quantization makes sense. For more generic chat-like purposes, Q2 of a larger model may often be preferable to Q4 of a smaller one.
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
I love the work unsloth is doing. I only wish gguf format had better vllm support. It’s sometimes hard to find trustworthy quants that work well with vllm.
What does it mean to say that “99.9% KL divergence” is some number like 3? In AI research and math, KL divergence is a pseudo-distance metric from one distribution to another. (Not technically a distance between two distributions because it’s asymmetric.)
Folks here who spend lots of time thinking about compressing models apparently have some specific interpretation of the term. Can somebody educate me? Because I only
Understand the math definition.
The confusing thing here is that there are two distributions involved here. There's the distribution over the vocabulary (possible values of each token) and the distribution over the sequence of tokens in each document.
Here, the KL Divergence is calculated over the vocabulary's distribution - for a specific token, it is measuring how much the quantized model's predictions differ from the reference model. 0 means a perfect match (no loss of quality from quantizaton), and some large number like 4 nats meaning the quantized model's predictions for that token differ substantially from the reference model.
The 99.9% is taken over the sequence of tokens. So it ranks all the tokens in a corpus, and it effectively finds the token with the worst predictions (relative to the reference model) out of every 1000 tokens. That's the 99.9%ile part.
Maxious|1 day ago
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
danielhanchen|1 day ago
Kayou|1 day ago
roxolotl|1 day ago
mirekrusin|1 day ago
cpburns2009|1 day ago
RS-232|1 day ago
Any resources for configuring the local setup?
My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.
jychang|1 day ago
Archit3ch|1 day ago
FuckButtons|1 day ago
jychang|1 day ago
tosh|1 day ago
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
danielhanchen|1 day ago
qskousen|1 day ago
electroglyph|1 day ago
danielhanchen|1 day ago
tenpa0000|1 day ago
So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.
Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.
am17an|1 day ago
zozbot234|1 day ago
santa_boy|1 day ago
Any HN model recommendations to run on my 24GB M5 and any best practices while running them?
Havoc|1 day ago
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
danielhanchen|1 day ago
deepsquirrelnet|1 day ago
dyl000|1 day ago
oofbey|15 hours ago
Folks here who spend lots of time thinking about compressing models apparently have some specific interpretation of the term. Can somebody educate me? Because I only Understand the math definition.
oofbey|2 hours ago
Here, the KL Divergence is calculated over the vocabulary's distribution - for a specific token, it is measuring how much the quantized model's predictions differ from the reference model. 0 means a perfect match (no loss of quality from quantizaton), and some large number like 4 nats meaning the quantized model's predictions for that token differ substantially from the reference model.
The 99.9% is taken over the sequence of tokens. So it ranks all the tokens in a corpus, and it effectively finds the token with the worst predictions (relative to the reference model) out of every 1000 tokens. That's the 99.9%ile part.
raphaelmolly8|1 day ago
[deleted]
aichen_dev|1 day ago
[deleted]
MarcLore|1 day ago
[deleted]
shablulman|1 day ago
[deleted]
roolgo|1 day ago
[deleted]
CaptainFever|15 hours ago