I would love to see a Python (or Haskell) implementation of gzip (and others) that is optimized for:
* readability and
* introspection
rather than speed and memory usage. That way, you could step into every primitive, and see intermediate results in a human-readable representation (as provided by the programming language). Encoding that stuff to binary would always happen in a separate step.
That way, those compression algorithms, with all details, would be easier to learn and to understand. Also, development of new, customized compression schemes would be a lot easier than it is today, where you either have to take existing ones, or have to start from scratch.
A custom implementation might be as simple as a rearrangement and recombination of existing building blocks, or may introduce completely new (possibly domain specific) building blocks.
You feeling ancient thinking back about Pak, lharc, arj and bimodem makes me feel really ancient. I remember dialing the BBS on the phone and then sticking the handset in the acoustic coupler of my Novation CAT which was connected to the LNW System Expansion attached to my TRS-80 Model 1.
You and me both --- I remember stumbling across .ARJ files on abandonware sites at the tail-end of the '90s. Even then, they were something of a curio; ZIP was dominant format, and RAR was gathering pace.
I hadn't seen this before, but it does a great job of explaining how compression works in a way that the less technically minded can understand. I think I finally have an understanding how compression works, rather than treating it as a semi-magical process as I've done previously.
A great article. I wonder if there are similarly thorough descriptions of the compression part of gzip, and how to write that well?
The uncompression is perhaps the easy part of the process, and while you can understand the principles of compression from it, you can easily miss the massive complexity involved. A naive search for backpointers is O(n^2) or worse, so a usable gzip implementation has to do cunning tricks to find repeating data efficiently.
I just peeked at the gzip source (actually, the algorithm.doc file in the root of the source). It looks like they limit back-pointers to point to the previous 32k bytes, immidietly reducing this to a naive O(n).
From my brief reading, it looks like they use a hash table to speed up lookups. Each 3 byte string is entered into the hashtable, with a pointer to the relevent location in the original text. Each has index has a linked list of entries. To find what backpointer to use for a given string, gzip finds the relevent linked list from the hash table, and scans it looking for the longest match.
More recent entries in the list are preferred (as smaller distances make the pointers smaller because of the encoding). Additionally, the linked lists are arbitrarily truncated to a certain length (determined by the '-1' through '-9' run-time argument).
gzip also has a 'lazy evaluation mecanism'. This is where, after it matches a string, it checks the next byte anyway (even if this byte is contained in the previous match). If a longer match is found, gzip discards the previous pointer and uses the new match. This is disabled in the fast speed modes.
If they are doing anything particularly clever, it is in the implementation details. The algorithm itself seems relatively straightforward.
[+] [-] vog|12 years ago|reply
* readability and
* introspection
rather than speed and memory usage. That way, you could step into every primitive, and see intermediate results in a human-readable representation (as provided by the programming language). Encoding that stuff to binary would always happen in a separate step.
That way, those compression algorithms, with all details, would be easier to learn and to understand. Also, development of new, customized compression schemes would be a lot easier than it is today, where you either have to take existing ones, or have to start from scratch.
A custom implementation might be as simple as a rearrangement and recombination of existing building blocks, or may introduce completely new (possibly domain specific) building blocks.
[+] [-] joosters|12 years ago|reply
For understanding the algorithm, surely it's better to look at sources like this document rather than a code interpretation of it?
[+] [-] jvns|12 years ago|reply
[+] [-] epsylon|12 years ago|reply
[+] [-] notlisted|12 years ago|reply
[+] [-] tssva|12 years ago|reply
[+] [-] koralatov|12 years ago|reply
[+] [-] pjmlp|12 years ago|reply
Which had the feature before PkZip, if memory does not trick me.
Or my friends on the Amiga world exchanging LHA files.
[+] [-] conexions|12 years ago|reply
[+] [-] koralatov|12 years ago|reply
[+] [-] dunham|12 years ago|reply
[+] [-] SanjayUttam|12 years ago|reply
[+] [-] joosters|12 years ago|reply
The uncompression is perhaps the easy part of the process, and while you can understand the principles of compression from it, you can easily miss the massive complexity involved. A naive search for backpointers is O(n^2) or worse, so a usable gzip implementation has to do cunning tricks to find repeating data efficiently.
[+] [-] gizmo686|12 years ago|reply
From my brief reading, it looks like they use a hash table to speed up lookups. Each 3 byte string is entered into the hashtable, with a pointer to the relevent location in the original text. Each has index has a linked list of entries. To find what backpointer to use for a given string, gzip finds the relevent linked list from the hash table, and scans it looking for the longest match.
More recent entries in the list are preferred (as smaller distances make the pointers smaller because of the encoding). Additionally, the linked lists are arbitrarily truncated to a certain length (determined by the '-1' through '-9' run-time argument).
gzip also has a 'lazy evaluation mecanism'. This is where, after it matches a string, it checks the next byte anyway (even if this byte is contained in the previous match). If a longer match is found, gzip discards the previous pointer and uses the new match. This is disabled in the fast speed modes.
If they are doing anything particularly clever, it is in the implementation details. The algorithm itself seems relatively straightforward.