top | item 40261540

(no title)

blackle | 1 year ago

There is a way to do this same compression by utilizing the raw probability distributions that the LLM produces as output. Fabrice Bellard has some experiments with doing this with transformers: https://bellard.org/nncp/

The idea is that if you can produce an accurate probably distribution over the next bit/byte/token, then you can compress things with an entropy compressor (huffman encoding, range encoding, asymmetric numeral systems, etc). This comment is too small of a space to explain fully how they work, but it suffices to say that pretty much every good compression algorithm models probability distributions in some way.

discuss

FooBarBizBazz|1 year ago

Exactly this. The way to do this is to use the LLM as the statistical model inside an arithmetic coder. You use its next-token probabilities to encode the next token. The KL divergence between the LLM's probabilities, and the empirical probabilities in the corpus that you actually compress, is the compression inefficiency.

saeranv|1 year ago

> The idea is that if you can produce an accurate probably distribution over the next bit/byte/token...

But how can you get credible probability distributions from the LLMs? My understanding is that the outputs specifically can't be interpreted as a probability distribution, even though superficially they resemble a PMF, due to the way the softmax function tends to predict close to 100% for the predicted token. You can still get an ordered list of most probable tokens (which I think beam search exploits), but they specifically aren't good representations of the output probability distribution since they don't model the variance well.

blackle|1 year ago

My understanding is that minimizing perplexity (what LLMs are generally optimized for) is equivalent to finding a good probably distribution over the next token.