top | item 6920822

Dissecting the GZIP format (2011)

102 points| siromoney | 12 years ago |infinitepartitions.com | reply

19 comments

order
[+] vog|12 years ago|reply
I would love to see a Python (or Haskell) implementation of gzip (and others) that is optimized for:

* readability and

* introspection

rather than speed and memory usage. That way, you could step into every primitive, and see intermediate results in a human-readable representation (as provided by the programming language). Encoding that stuff to binary would always happen in a separate step.

That way, those compression algorithms, with all details, would be easier to learn and to understand. Also, development of new, customized compression schemes would be a lot easier than it is today, where you either have to take existing ones, or have to start from scratch.

A custom implementation might be as simple as a rearrangement and recombination of existing building blocks, or may introduce completely new (possibly domain specific) building blocks.

[+] joosters|12 years ago|reply
GNU gzip is surprisingly well commented (as in, I was surprised to find some multi paragraphed comments at all!)

For understanding the algorithm, surely it's better to look at sources like this document rather than a code interpretation of it?

[+] notlisted|12 years ago|reply
Nice write-up. Brings back good memories. (Pak, LHarc and above all ARJ). BiModem FTW! I feel ancient.
[+] tssva|12 years ago|reply
You feeling ancient thinking back about Pak, lharc, arj and bimodem makes me feel really ancient. I remember dialing the BBS on the phone and then sticking the handset in the acoustic coupler of my Novation CAT which was connected to the LNW System Expansion attached to my TRS-80 Model 1.
[+] koralatov|12 years ago|reply
You and me both --- I remember stumbling across .ARJ files on abandonware sites at the tail-end of the '90s. Even then, they were something of a curio; ZIP was dominant format, and RAR was gathering pace.
[+] pjmlp|12 years ago|reply
Me too. I still remember the joy of being able to use ARJ in MS-DOS to split archives across multiple floppies.

Which had the feature before PkZip, if memory does not trick me.

Or my friends on the Amiga world exchanging LHA files.

[+] koralatov|12 years ago|reply
I hadn't seen this before, but it does a great job of explaining how compression works in a way that the less technically minded can understand. I think I finally have an understanding how compression works, rather than treating it as a semi-magical process as I've done previously.
[+] dunham|12 years ago|reply
I have a pretty solid understanding of compression algorithms and the BW transform in bzip2 still seems semi-magical to me:

  ftp://apotheca.hpl.hp.com/gatekeeper/pub/dec/SRC/research-reports/SRC-124.pdf
[+] joosters|12 years ago|reply
A great article. I wonder if there are similarly thorough descriptions of the compression part of gzip, and how to write that well?

The uncompression is perhaps the easy part of the process, and while you can understand the principles of compression from it, you can easily miss the massive complexity involved. A naive search for backpointers is O(n^2) or worse, so a usable gzip implementation has to do cunning tricks to find repeating data efficiently.

[+] gizmo686|12 years ago|reply
I just peeked at the gzip source (actually, the algorithm.doc file in the root of the source). It looks like they limit back-pointers to point to the previous 32k bytes, immidietly reducing this to a naive O(n).

From my brief reading, it looks like they use a hash table to speed up lookups. Each 3 byte string is entered into the hashtable, with a pointer to the relevent location in the original text. Each has index has a linked list of entries. To find what backpointer to use for a given string, gzip finds the relevent linked list from the hash table, and scans it looking for the longest match.

More recent entries in the list are preferred (as smaller distances make the pointers smaller because of the encoding). Additionally, the linked lists are arbitrarily truncated to a certain length (determined by the '-1' through '-9' run-time argument).

gzip also has a 'lazy evaluation mecanism'. This is where, after it matches a string, it checks the next byte anyway (even if this byte is contained in the previous match). If a longer match is found, gzip discards the previous pointer and uses the new match. This is disabled in the fast speed modes.

If they are doing anything particularly clever, it is in the implementation details. The algorithm itself seems relatively straightforward.