top | item 30402942

(no title)

To improve compression of a sorted list of words you can replace the (initial) letters repeated from the word above with spaces before compression and add them back as an extra step after decompression. For example, if the previous word was "apple", the next entry will be " y" ("apply", edit: HN removes extra spaces, so this should be four spaces + "y") ("apple" will probably already be entered as " le" (three spaces + "le")). In theory a compression algorithm could handle this automatically, but in practice this gives better compression.

This is conceptually similar to what OP does by storing the (numerical) difference between the words. Also, if you have a list of numbers that aren't random, they generally compress better if you turn it into a list of the differences between the numbers.

A simple compression algorithm (miniLZO is apparently 6KB compiled) might be small enough and save enough bytes with compression to make it worth it for OP.

discuss

pronoiac|4 years ago

You might want to look at locatedb and frcode - https://manpages.ubuntu.com/manpages/bionic/man5/locatedb.5....

If you want to skim, check out the EXAMPLE section toward the bottom.

creatonez|4 years ago

Testing this with zstd -19

29.44% - Original sorted list just compressed with zstd

22.45% - Matching prefix characters from previous word replaced with space

19.38% - Matching prefix characters from previous word removed

hinkley|4 years ago

In the more general sense, there is an almost universally overlooked property of compressed data that some of the data is unordered, and you are free to rearrange it any way you want in order to improve the compression ratio.

Years ago I worked on a J2ME (Java2 Mobile Edition) application that had no business being attempted given the very small archive files allowed. We did it anyway and it actually worked pretty well. We very quickly hit the max file size however, and every feature request meant first shrinking the existing code base to make space. First we had an intern fixing bugs in the code minifier we were using, especially around deleting unused (usually debug) methods. For some reason they rejected on archive size, not payload size, so while I started out doing 'honest' work with shrinking the binary, I had spent a lot of time in college noodling with compression algorithms so my eye was eventually drawn there.

Those were in the days when I could still read JVM assembly code, and shortly after I started thinking about the compression, I realized that the constant pool entries start with the type and then the size of the entry. So while our minifier made the reasonable assumption of sorting the constant pool by type and then alphabetically within it, because most of the constant pool was strings, and strings are variable length, it was hit or miss whether the header would be treated as a run or just Huffman encoded (the fallback). If I suffix sorted, then all but the last string in the pool would be followed immediately by the header for the next string, increasing the average run length.

This ended up knocking almost a kilobyte off of the archive size. Depending on your perspective that sounds like a little or a lot, but in our case each feature cost about 500 bytes, so that change pushed the cliff I was walking toward out almost a month (and slowing the growth rate), just by changing a sort algorithm.

I filed a ticket with Sun about this, but as it turns out they already had the dense archive format in flight, and within a couple months my observation was moot because the dense format can compress constant pools across and entire archive, not just a singe file. That was at least an order of magnitude better than what I had.

It's quite likely a lot of the files we use have similar problems in them. Off the top of my head, JSON compression probably would be much higher if we treated it as unordered, and did more aggressive minification particularly for JSON-at-rest. Sorting sibling keys by value instead of by name for instance.

HALtheWise|4 years ago

Because all the words are five letters here, you should be fine to elide leading spaces in this case.

abainbridge|4 years ago

How would that work? If I want to store "apple", "apply", "apron" the bytes in memory would be "appleyron". How would I know where the second word ends?

unknown|4 years ago

[deleted]

quicktwo|4 years ago

Is this just a trie, but you reverse all the words first?