The Elegance of Deflate

[+] beagle3|9 years ago|reply

There were a lot of encoders at the time using this general scheme (a few more values to indicate match length or distances). PKZIP won at the time because it was faster, and PK had the name recognition from his PKARC, which was a superfast implementation of SEA ARC (the dominant archiver on PCs at the time).

PK had to stop working on PKARC because of a C&D request from SEA. He wrote the first algos of PKZIP, which were on par with SEA ARC on compression (and with PKARC on speed), but weren't much better. (And have been deprecated since the 1990 if I'm not mistaken).

Then, the Japanese encoders started to rule the compression ratio (and had comparably reasonable compression times) - LHArc, LZAri, don't remember the rest of the names. LHArc or LHA (don't remember which), had basically the same scheme that PKZIP converged on, except it used adaptive arithmetic coding. PK replaced that with static huffman coding, trading a little compression for a lot of speed, and the format we now know and love as "zlib compression" was born (and quickly took the world by storm, being in a sweet spot of compression and speed).

There's another non-trivial thing that PKZIP had going for it - it put the directory at the end, which meant you could see the list of files in the archive without reading the entire archive! This sounds simple, but back then everyone adopted the tar-style "file header then data" appendable style, which meant that just listing the files inside a 300KB zip file (almost a whole floppy disk!) meant reading that entire floppy (30 seconds at least). PKZIP could do it in about 2 seconds.

[+] creshal|9 years ago|reply

> There's another non-trivial thing that PKZIP had going for it - it put the directory at the end, which meant you could see the list of files in the archive without reading the entire archive!

Beginning and the end, which continues to bite us in the ass until today, where we regularly stumble over bugs when one part of a toolchain uses one entry and the rest the other.

[+] carey|9 years ago|reply

Weren't there also (alleged) patents on arithmetic coding back then? Huffman coding had that as an advantage, too, especially when GNU were looking for an alternative to Unix' compress program.

[+] userbinator|9 years ago|reply

LZH actually remains very common in BIOS images, probably because of its superior compression in relation to the other formats along with its low implementation complexity.

See this article, for example:

http://www.intel.com/content/www/xr/ar/support/boards-and-ki...

The file it references can be downloaded here:

https://downloadcenter.intel.com/download/8375/Logo-Update-U...

...and it's somewhat amazing that it contains the original LHA.EXE from 25 years ago:

    LHA version 2.13                      Copyright (c) Haruyasu Yoshizaki, 1988-91
    === <<< A High-Performance File-Compression Program >>> ========  07/20/91  ===

[+] acqq|9 years ago|reply

And "there's another non-trivial thing that PKZIP had going for it":

https://en.wikipedia.org/wiki/Phil_Katz

"SEA was attempting to retroactively declare the ARC file format to be closed and proprietary. Katz received positive publicity by releasing the APPNOTE.TXT specification, documenting the Zip file format, and declaring that the Zip file format would always be free for competing software to implement."

[+] sebular|9 years ago|reply

Really enjoyed reading this, I think I'd like to look more closely at an implementation to understand it further.

Reading the article, I was reminded of a nagging question I've had in the back of my mind for a bit.

The ASCII table was invented in the '70s, when the cost of disk storage was much higher than it is today. But it's a shared dictionary, a standard that we can all agree upon, something that's already on every computer.

The thing I've wondered about is whether there could be any advantage to creating new generations of shared dictionaries that are domain-specific, and much larger.

For example, in the specific (and over-simplified) case of transmitting large quantities of English text from a server to a client, you could reduce the amount of data sent over the wire if both parties shared a lookup table that contained the entire English dictionary. In that case, you wouldn't transmit a book as a collection of characters, but rather as a collection of words.

Furthermore, it would seem like you could apply the same traditional compression methods like what's described in the article to furthermore reduce the amount of data being sent. Rather than identifying repeating patterns of letters, you would identify repeating patterns of words.

Of course, the obvious drawback is that an English lookup table is useless for transmitting anything other than English text. But again, disk storage being as cheap as it is, I wonder if it wouldn't be such a monumental problem to store many domain-specific dictionaries.

Of course, you'd always want to keep the ASCII table as a fallback. Much in the same way that a conversation in sign language (ASL, specifically) is largely composed of hand gestures referring to words or concepts, but the language still includes the complete alphabet as a fallback.

The thing I don't understand well enough is whether modern compression already incorporates concepts that are similar enough such that a shared-word dictionary would be useless. It seems like the "LZXX" style of compression essentially includes a similar but dynamically-generated sort of dictionary at the beginning of the transmission, and subsequently refers to that in order to express the message itself. Would the gains of having enormous shared dictionaries cancel out the advantage of that approach, only in a much more "wasteful" way?

[+] xenadu02|9 years ago|reply

The cost-benefit is just too low for this to be practical in most cases:

* The large data being transferred is almost never text; it's video, images, or audio which is already using advanced (sometimes lossy) compression * Even if it is text, it's probably in some document format or HTML which greatly complicates the shared dictionary scenario (and would require a separate dictionary for each format)

In reality even deflate is already doing what you propose, but it builds the dictionary dynamically and includes it with each message. The up side is its dictionary can find common phrases and patterns that wouldn't be in your fixed dictionary. As the size of the text becomes larger the dictionary becomes a smaller and smaller percentage of the total.

[+] carey|9 years ago|reply

It sounds like what you're thinking of has been included in the Brotli compression algorithm, which uses a relatively large initial dictionary. There are similar ideas with a more limited scale in HTTP/2's HPACK algorithm for header compression.

[+] Erwin|9 years ago|reply

Google's SDCH is kind of that, but each website can in advance decide what the contents of the shared dictionary are. So when you download a SDCH compressed page you may also download a dictionary. Information and experiences with usage are sparse, here's one article from LinkedIn: https://engineering.linkedin.com/shared-dictionary-compressi... -- "Average additional compression across our static content was about 24%"

[+] Const-me|9 years ago|reply

Microsoft implemented large pre-shared dictionary for their binary XML format. When used in WCF, the XML is a SOAP envelope with some data-contract serialized messages. A lot of XML elements/attributes/values are known in advance. Works very well in practice, here's my blog post about that: http://const.me/articles/net-tcp/

[+] Lerc|9 years ago|reply

I considered a project loosely along those lines. My idea was a One-Meg-of-Data project. Define a megabyte of data that everyone would have have in a bit-identical form.

The appeal of this to me would be in the code golf style of approach of specifying the actual data. Apart from plain useful raw data, It would include a bare minimum VM that can be used to generate larger outputs from the data. More complex VMs would be permitted provided a reference implementation can be run on the Minimal VM. The code for the reference Implementation would, of course, be included in the one megabyte.

A program using the data could request

* raw bytes

* output from the VM given code and parameters (with perhaps a cycle limit to stop endless loop hangs)

* output from the VM given code and parameters specified at an index in the data.

* output from the VM given code and parameters specified at an index in the data plus user parameters.

The system would be stateless, the VM would only exist for the duration of the data generation call. The same parameters would always produce the same output.

I have doubts as to it's fundamental usefulness but it would certainly make a fun collaborative project at the very least.

[+] TD-Linux|9 years ago|reply

You exactly describe Brotli. It has a 100KB pre shared table trained on a large corpus of web content.

[+] dalke|9 years ago|reply

The ASCII standard was from the 1960s, not 1970s. LBJ in 1968 mandated that all federal computers purchased by the government must support ASCII.

This historical correction does not alter the rest of your comment.

[+] rakoo|9 years ago|reply

> Would the gains of having enormous shared dictionaries cancel out the advantage of that approach, only in a much more "wasteful" way?

It does. The extreme of it is storing all information inside pi, as is done here: https://news.ycombinator.com/item?id=6698852. In the end the dictionary is so big that the indexes themselves start to be too big.

If you really had infinite space and immediate access to all digits, there actually could be some way to make it work, though.

[+] jasonwatkinspdx|9 years ago|reply

> The thing I've wondered about is whether there could be any advantage to creating new generations of shared dictionaries that are domain-specific, and much larger.

The HPACK header compression used by HTTP2 has a static dictionary.

> The thing I don't understand well enough is whether modern compression already incorporates concepts that are similar enough such that a shared-word dictionary would be useless.

I believe this is generally the case.

[+] sn41|9 years ago|reply

I think this is a good idea and could reduce file sizes for compressing text. Moreover, people's active vocabulary is quite small, so even more might be possible. Just as an aside, a cliche was originally a single metal slug used to cast a repeatedly used phrase. So phrase-wise compression of English text might be useful.

However, I don't know the practical improvement obtained over the currently existing methods. This is a good project to investigate, I think. Even if text compression might not be important in the larger picture of data transmission on the web, this sounds interesting on its own merit.

[+] Dylan16807|9 years ago|reply

It's a good system. Though the full DEFLATE spec has quirks like storing literal blocks prefixed by their length and then the ones complement of their length.

The next step in improving a simple compression algorithm is to replace Huffman encoding with arithmetic/range encoding. Huffman encoding proceeds one bit at a time, treating everything as having a probability that's a power of 1/2. Arithmetic/range encoding uses more accurate probabilities and then packs them together, letting you save a fraction of a bit with every symbol. As an analogy, consider how to encode 3 decimal digits. You could spend 4 bits on each digit, and it would take 12 bits total. Or you could combine them into a single 10 bit number.

[+] brian-armstrong|9 years ago|reply

I really enjoyed this article. It's just enough detail to leave you thinking, "Hey, I should write my own DEFLATE library"

[+] bbcbasic|9 years ago|reply

Would be fun to do that in Haskell

Edit: https://github.com/GaloisInc/pure-zlib?files=1. Beautiful they've done just the decompression! Challenge awaits

[+] xxs|9 years ago|reply

>>"Hey, I should write my own DEFLATE library"

Did that in Java (2011) and the impl. was faster than the native one (party due to lack of proper customization of zlib from java code)

[+] xxs|9 years ago|reply

The article mentions LZW. In terms of algorithm elegance LZW should be head and shoulders above 'deflate'. LZW got patented which (aside gifs) screwed up its popularity. 'defalte' allows custom dictionaries that adds some extra complexity.

I'd strongly disagree with In many people's eyes compression was still synonymous with run-length encoding as arj was extremely popular to the point it was used to cure full stealth viruses (v512 for example). hard to provide references now but wiling to search some old bbs

Other than that - fairy nice/nostalgic article.

[+] eps|9 years ago|reply

arj ... daaamn. That brings back some sweet memories to the soundtrack of a modem handshake whistle. Good thing we still have Norton Commander in a form of Far Manager to show some of the past glory to youngsters :)

[+] gpvos|9 years ago|reply

I very recently had to look at an LZW decoder, and it is indeed more elegant than Deflate as described here. However, LZW doesn't have a Huffman post-encoding step, so it does not compress as well as Deflate.

[+] jamesfisher|9 years ago|reply

> This paragraph has five-hundred and sixty-one lower-case letters in

It must have taken some trial-and-error to make this sentence correct, since the number itself adds some letters.

[+] syntheticnature|9 years ago|reply

Interestingly, it can work in-place if you change it to "This paragraph has five-hundred and sixty-three lower-case letters in"

[+] to3m|9 years ago|reply

Pick a value that's close enough, and tweak everything else until it's true :) This post has one hundred and thirty-two chars in it.

[+] shenberg|9 years ago|reply

The most clever part of deflate is actually how they store the huffman trees without storing the frequencies but rather the length in bits of each symbol. Combining the length of each symbol with its lexicographical ordering is enough to uniquely specify the actual tree.

Ok, some research showed me that the idea can probably be traced back to Schwartz & Kallick 1964 (can't find the text), popularized by TAOCP vol. 1...

[+] userbinator|9 years ago|reply

The technique is known as "Canonical Huffman" and is quite elegant indeed. It is best understood by considering the codes as binary numbers, then the tree structure naturally appears. There's a very short algorithm to generate the appropriate encoding and decoding tables in the JPEG standard and it involves not much more than shifts and adds.

[+] Jasper_|9 years ago|reply

I wouldn't say that entropy coding is "separate" from dictionary / window compression. DEFLATE surely isn't the first algorithm to combine LZ77 and some form of entropy coding, and surely won't be the last.

I highly recommend that everybody go out and build an INFLATE implementation; it's a lot of fun.

Mine can be found at https://github.com/magcius/libt2/blob/master/t2_inflate.h

[+] dunham|9 years ago|reply

I did this in javascript years ago (I had done snappy first, but wanted smaller data). It's fun and not too difficult.

The deflate side seems a little more complex. I played with it while on vacation about a month ago, but stalled after doing the LZ77 and huffman table generation, because I really just wanted to learn on the LZ77 bits worked, and I don't actually have a project that needs it.

[+] flohofwoe|9 years ago|reply

Anyone know what's the history behind that setjmp/longjmp as 'poor man's exception' error handling? Since C works perfectly fine without exceptions, why add such a hack? I've seen this in libpng and I think libjpeg has it too (one of many reasons I'm avoiding those libs).

[+] etatoby|9 years ago|reply

If you have ever done any non-trivial coding in C, you will have come across the need (or the advantages) of using goto for local error handling within a function.

You can write the error handler code for a given function (which usually just resets some resources and returns an appropriate error value) in a single place, usually at the end, but jump to it from within any loop or structure within the function, as soon as you detect that the computation cannot proceed.

setjmp / longjmp is the natural extension of this mechanism to an entire library, as in a set of tightly coupled functions that are hopefully hidden behind a public API. I think it's a good design choice. If done properly it makes the code more readable, more easily maintainable, and more optimized for the most common execution path. I wouldn't avoid using those libraries just for that design choice.

[+] johncolanduoni|9 years ago|reply

C lacks "zero-cost" exceptions, where the code should be just as fast as it would be if no exceptions were involved as long as one isn't thrown. Although they are much more expensive when thrown than checking a return value and using indirection for any logical return values, they are a bit cheaper when you don't expect the exception to happen very often, or don't care about performance if it does (e.g. a corrupt png or jpeg file).

My dream is something like the UX of Swift's exceptions (basically normal C error handling except for it does all the tedious pointer work for you) with the ability to tell the compiler to either implement the exceptions as return values or stack unwinds for a particular code path with the flick of a switch.

[+] lmm|9 years ago|reply

C doesn't work fine without exceptions. You have to write an enormous amount of boilerplate code to propagate error-returns correctly, or else an enormous amount of boilerplate duplication of your error-handling code. Or you use macros to reduce your boilerplate but it's difficult to do that in a way that handles different return types correctly, and once you get to the point of using a different return convention you're basically implementing your own language.

[+] searealist|9 years ago|reply

You don't have to use setjmp/longjmp in libpng, it's just the default behavior. You can instead set an error callback to do whatever you want.

[+] chubot|9 years ago|reply

I see setjmp/longjmp used a lot in recursive descent parsers in C.

I think people got tired of using up their return values for error codes. And it is more efficient to use this style certainly than manually breaking out of each stack frame.

C is full of things like this which break abstractions -- it doesn't seem particularly odd to me. Though I do remember that it was one of the things at the end of C books that I didn't understand for a long time.

What's the downside?

It doesn't play well with C++ because destructors have to be run at the end of every scope, but that doesn't apply to C.

And I guess because it mutates globals, it is a bit confusing in a threaded context.

[+] pjc50|9 years ago|reply

It's the only way to do a non-local return, otherwise you have to make sure you check every error return condition. That's part of why Rust forces the programmer to check all return values, I believe.

[+] mozumder|9 years ago|reply

Deflate is so ubiquitous (being a web standard) that I'd like to see a hardware Deflate encoder/decoder.

It really needs to be in the core instruction set of x86 & ARM.

[+] userbinator|9 years ago|reply

http://www.xilinx.com/products/intellectual-property/1-7aisy...

[+] banachtarski|9 years ago|reply

Using additional bits to encode lengths isn't really that different from having a marker indicating if the payload is a match vector or not. In fact, it could be argued to be less efficient for certain types of data depending on the huffman distribution since all "words" are larger than a byte. It also prevents you from encoding longer matches. I treat the decision as a compromise between the two extremes that "worked" for the majority of types of data "good enough" for some definition of both quoted words. The entire idea works well enough but I wouldn't go out of the way to declare it a marvel of engineering or anything.

[+] HisGraceTheDuck|9 years ago|reply

"it could be argued to be less efficient for certain types of data" This is true of all lossless compression.

[+] caf|9 years ago|reply

The single-bit method can actually be seen as an instance of the huffman-coded method, where the match lengths are all assigned equal entropy estimates and the probability of "match vector" versus "literal character" is hard-wired to 0.5.

[+] brian-armstrong|9 years ago|reply

This comment shows a baffling level of arrogance. And it can hardly be considered a happy accident, compression methods are tested against real corpuses

82 comments