top | item 42901725

(no title)

loxias | 1 year ago

I always understood it as working because of the predictability of a symbol/letter/token given the previous one.

Sorting all the shifts of a string puts all the characters in order, then looking at the last column shows you all the _preceding_ characters. If there's any predictability there (which there often is), it's now easier to compress. It's sorta like an entropy coder in that way.

I've never thought of it as being that deep, and understood them since I was a kid -- building an intuition for "why" the FFT works is much harder -- but that being said, I clicked quickly to reply thinking "that's easy! I can explain this!" then struggled for a while trying to get the picture in my mind into text. :)

discuss

crazygringo|1 year ago

But shouldn't Huffman coding already detect that same predictability and compress it the same?

What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.

loxias|1 year ago

> What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.

Ahhhh. Now we're on the same page. :) Seeing how it helps when combined is somewhat subtle/non-obvious. I believe it relates to BWT and Huffman both being approximations of something more optimal. The two transforms could also have different window sizes -- one rarely does BWT on a whole 1GB file -- which introduce inefficiencies. Huffman coding is also only optimal in the very large alphabet and very long data limits. As your data length and alphabet size decrease, it gets less optimal.

Put differently, "I think that's a wonderfully phrased question, this _is_ my specialization/subfield, and I'm gonna need to chew on it for a while."

jltsiren|1 year ago

One way to look at data compression is splitting it into a model and an encoder. The model describes the data, while the encoder encodes (or equivalently predicts) the data according to the model. The compressed output consists of the serialized model and the encoded data. BWT is a model, while Huffman is an encoder.

Huffman takes a probability distribution and a symbol and encodes the symbol according to the distribution. If you encode all symbols independently according to the same distribution, you probably don't get very good compression.

You get a bit better results with a model that has a separate distribution for each context. If the previous symbols were X, Y, and Z, you encode the next symbol according to the distribution for context XYZ. This approach doesn't really scale, because the size of the model grows rapidly (exponentially in a naive implementation) with context length. You get better compression with an adaptive model. You start with a uniform distribution and update the available contexts and distributions after encoding each symbol. On the one hand, you don't have to store the model explicitly. But on the other hand, updating the model is very slow.

Burrows-Wheeler transform is an implicit model. It sorts the symbols according to the context that follows them, and it does that simultaneously for each context length. Because you don't have to store the model explicitly, you can effectively use longer context lengths than with a fixed explicit model. And because you don't have to update an explicit model after encoding each symbol, using the BWT is much faster than using an adaptive model.

palaiologos|1 year ago

Hi, tool author here.

Huffman coding is a static minimum-redundancy code. What this means is that it finds an optimal assignment of bit sequences to letters in the input alphabet (commonly US-ASCII or extensions). This however means that Huffman coding can not exploit redundancies that stem from the concrete sequence of characters. For example, you could easily predict that an `e` comes after `Th`, but Huffman coding can not know that.

Hence after applying the Burrows-Wheeler transform you need to have some sort of a higher-order transform (i.e. a transform that considers more than just individual bytes) which somehow reaps from the changed distribution of the result of the algorithm. But we will get to that in a second.

The joke here is that the Burrows-Wheeler transform is closely related to suffix trees and suffix arrays, which are often used in bioinformatics and HPC for full-text search. If you wanted to find a pattern of length `p` in a text of length `n`, if you already have a suffix tree of the original text, the search is linear in the length /of the pattern/ - i.e. O(p). The suffix tree stores all suffixes of a string in a compressed manner (i.e. it has a linear space overhead, approximately O(20n) as given by Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press), so you can search for a word in it by simply traversing from the root node to an internal or leaf node by following a sequence of bytes that comprise the word.

As such, a suffix tree (and equivalently suffix array and the BWT, which is trivially computed from a suffix array) form something which can be thought of as a static PPM model. Notably real world implementations of PPM use suffix trees as a part of their main storage data structure (e.g. PPMd). What this all means is that given a suffix tree, we can very cheaply give the probability distribution for the next byte that follows a given fixed-order sequence of bytes. This is nice, because then e.g. an order-2 predictor would be able to tell that `Th` is followed by `e` once enough data has been gathered.

As you can probably guess, the more preceding bytes you know, the better will be your estimate for what is the most likely next byte. But the larger your context, the more expensive the searches and computations become due to pointer chasing in the suffix tree.

So how do we remedy this? We notice that the Burrows-Wheeler transform essentially clusters similar contexts together, meaning that a low order predictor (= faster, simpler) on BWT compresses as well as a high order predictor (= slow, complicated) on the original data, at the cost of an extra transformation. This is viable, because the Burrows-Wheeler transform can be quickly computed and there have been recent advancements in running it on the GPU. So what this means is that bzip3 uses BWT + a low order predictor with an arithmetic coder to encode the bytes, meaning that it can make use of high order statistics for compression and performs comparably at a faster speed.

eru|1 year ago

> But shouldn't Huffman coding already detect that same predictability and compress it the same?

Huffman coding only works on individual letters. It doesn't know anything about relationships between adjacent letters.

vanviegen|1 year ago

Understanding why increasing predictability helps with compression is not the hard part though. What's hard to grasp is why the transform is reversible.

gfody|1 year ago

a word can be factored into the set and frequency of letters + the specific permutation. compressable patterns in either channel seem likely when the underlying words are language like.