top | item 47204269

(no title)

oofbey | 21 hours ago

What does it mean to say that “99.9% KL divergence” is some number like 3? In AI research and math, KL divergence is a pseudo-distance metric from one distribution to another. (Not technically a distance between two distributions because it’s asymmetric.)

Folks here who spend lots of time thinking about compressing models apparently have some specific interpretation of the term. Can somebody educate me? Because I only Understand the math definition.

discuss

oofbey|7 hours ago

The confusing thing here is that there are two distributions involved here. There's the distribution over the vocabulary (possible values of each token) and the distribution over the sequence of tokens in each document.

Here, the KL Divergence is calculated over the vocabulary's distribution - for a specific token, it is measuring how much the quantized model's predictions differ from the reference model. 0 means a perfect match (no loss of quality from quantizaton), and some large number like 4 nats meaning the quantized model's predictions for that token differ substantially from the reference model.

The 99.9% is taken over the sequence of tokens. So it ranks all the tokens in a corpus, and it effectively finds the token with the worst predictions (relative to the reference model) out of every 1000 tokens. That's the 99.9%ile part.