top | item 42901935

(no title)

loxias | 1 year ago

> What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.

Ahhhh. Now we're on the same page. :) Seeing how it helps when combined is somewhat subtle/non-obvious. I believe it relates to BWT and Huffman both being approximations of something more optimal. The two transforms could also have different window sizes -- one rarely does BWT on a whole 1GB file -- which introduce inefficiencies. Huffman coding is also only optimal in the very large alphabet and very long data limits. As your data length and alphabet size decrease, it gets less optimal.

Put differently, "I think that's a wonderfully phrased question, this _is_ my specialization/subfield, and I'm gonna need to chew on it for a while."

discuss

order

crazygringo|1 year ago

Thanks. Yeah, I can see how that would make more sense if BWT was redundant under a theoretically perfect Huffman compression, but it happens to pick up some things that real-world Huffman encoders don't, with their practical limits on CPU and memory.

Sesse__|1 year ago

Nearly any real-world Huffman encoder is trivially optimal, i.e., given a set of symbols with probabilities, it is easy to construct the optimal set of output bits for each symbol. (There are some exceptions in that if you have a huge amount of symbols, or some that are extremely rare, you can bound the upper length and get a non-optimal code. This matters very little in practice, i.e., less than 0.1% in a typical setting. And of course, if you allow fractional bits or probabilities that change based on context, then you can do better, but then it's no longer Huffman.)

BWT is orthogonal to Huffman; like LZ, it exploits that some symbol _sequences_ are more common than others, while Huffman is about the optimality of coding each symbol on its own.