(no title)
oofbey | 21 hours ago
Folks here who spend lots of time thinking about compressing models apparently have some specific interpretation of the term. Can somebody educate me? Because I only Understand the math definition.
oofbey | 21 hours ago
Folks here who spend lots of time thinking about compressing models apparently have some specific interpretation of the term. Can somebody educate me? Because I only Understand the math definition.
oofbey|7 hours ago
Here, the KL Divergence is calculated over the vocabulary's distribution - for a specific token, it is measuring how much the quantized model's predictions differ from the reference model. 0 means a perfect match (no loss of quality from quantizaton), and some large number like 4 nats meaning the quantized model's predictions for that token differ substantially from the reference model.
The 99.9% is taken over the sequence of tokens. So it ranks all the tokens in a corpus, and it effectively finds the token with the worst predictions (relative to the reference model) out of every 1000 tokens. That's the 99.9%ile part.