top | item 47210307

(no title)

oofbey | 6 hours ago

The confusing thing here is that there are two distributions involved here. There's the distribution over the vocabulary (possible values of each token) and the distribution over the sequence of tokens in each document.

Here, the KL Divergence is calculated over the vocabulary's distribution - for a specific token, it is measuring how much the quantized model's predictions differ from the reference model. 0 means a perfect match (no loss of quality from quantizaton), and some large number like 4 nats meaning the quantized model's predictions for that token differ substantially from the reference model.

The 99.9% is taken over the sequence of tokens. So it ranks all the tokens in a corpus, and it effectively finds the token with the worst predictions (relative to the reference model) out of every 1000 tokens. That's the 99.9%ile part.

discuss

No comments yet.