Dissecting the GZIP format (2011)

[+] vog|12 years ago|reply

I would love to see a Python (or Haskell) implementation of gzip (and others) that is optimized for:

* readability and

* introspection

rather than speed and memory usage. That way, you could step into every primitive, and see intermediate results in a human-readable representation (as provided by the programming language). Encoding that stuff to binary would always happen in a separate step.

That way, those compression algorithms, with all details, would be easier to learn and to understand. Also, development of new, customized compression schemes would be a lot easier than it is today, where you either have to take existing ones, or have to start from scratch.

A custom implementation might be as simple as a rearrangement and recombination of existing building blocks, or may introduce completely new (possibly domain specific) building blocks.

[+] joosters|12 years ago|reply

GNU gzip is surprisingly well commented (as in, I was surprised to find some multi paragraphed comments at all!)

For understanding the algorithm, surely it's better to look at sources like this document rather than a code interpretation of it?

[+] jvns|12 years ago|reply

This article is pretty great. I used it as a reference to write gunzip in Julia: https://github.com/jvns/gzip.jl

[+] epsylon|12 years ago|reply

Have to mention that your blog is pretty awesome as well: http://jvns.ca/blog/2013/10/23/day-15-how-gzip-works/

[+] notlisted|12 years ago|reply

Nice write-up. Brings back good memories. (Pak, LHarc and above all ARJ). BiModem FTW! I feel ancient.

[+] tssva|12 years ago|reply

You feeling ancient thinking back about Pak, lharc, arj and bimodem makes me feel really ancient. I remember dialing the BBS on the phone and then sticking the handset in the acoustic coupler of my Novation CAT which was connected to the LNW System Expansion attached to my TRS-80 Model 1.

[+] koralatov|12 years ago|reply

You and me both --- I remember stumbling across .ARJ files on abandonware sites at the tail-end of the '90s. Even then, they were something of a curio; ZIP was dominant format, and RAR was gathering pace.

[+] pjmlp|12 years ago|reply

Me too. I still remember the joy of being able to use ARJ in MS-DOS to split archives across multiple floppies.

Which had the feature before PkZip, if memory does not trick me.

Or my friends on the Amiga world exchanging LHA files.

[+] conexions|12 years ago|reply

Slightly related. This was one of my favorite assignments from my algorithms class on coursera. We implemented part of bzip compression and decompression. http://coursera.cs.princeton.edu/algs4/assignments/burrows.h...

[+] koralatov|12 years ago|reply

I hadn't seen this before, but it does a great job of explaining how compression works in a way that the less technically minded can understand. I think I finally have an understanding how compression works, rather than treating it as a semi-magical process as I've done previously.

[+] dunham|12 years ago|reply

I have a pretty solid understanding of compression algorithms and the BW transform in bzip2 still seems semi-magical to me:

  ftp://apotheca.hpl.hp.com/gatekeeper/pub/dec/SRC/research-reports/SRC-124.pdf

[+] SanjayUttam|12 years ago|reply

This link was on HN a few weeks ago, but it's a pretty neat visualization of gzip at work: http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equal...

[+] joosters|12 years ago|reply

A great article. I wonder if there are similarly thorough descriptions of the compression part of gzip, and how to write that well?

The uncompression is perhaps the easy part of the process, and while you can understand the principles of compression from it, you can easily miss the massive complexity involved. A naive search for backpointers is O(n^2) or worse, so a usable gzip implementation has to do cunning tricks to find repeating data efficiently.

[+] gizmo686|12 years ago|reply

I just peeked at the gzip source (actually, the algorithm.doc file in the root of the source). It looks like they limit back-pointers to point to the previous 32k bytes, immidietly reducing this to a naive O(n).

From my brief reading, it looks like they use a hash table to speed up lookups. Each 3 byte string is entered into the hashtable, with a pointer to the relevent location in the original text. Each has index has a linked list of entries. To find what backpointer to use for a given string, gzip finds the relevent linked list from the hash table, and scans it looking for the longest match.

More recent entries in the list are preferred (as smaller distances make the pointers smaller because of the encoding). Additionally, the linked lists are arbitrarily truncated to a certain length (determined by the '-1' through '-9' run-time argument).

gzip also has a 'lazy evaluation mecanism'. This is where, after it matches a string, it checks the next byte anyway (even if this byte is contained in the previous match). If a longer match is found, gzip discards the previous pointer and uses the new match. This is disabled in the fast speed modes.

If they are doing anything particularly clever, it is in the implementation details. The algorithm itself seems relatively straightforward.

19 comments