As someone who has dabbled quite a lot with deflate this was a very interesting find. It seems like the format that never wants to die.
First of all I am surprised this is even possible. Given the extremely minimal and non-aligned nature of the deflate headers, I actually discarded this as being feasible, with the false positive rate likely being too high.
The paper and implementation is an amazing piece of engineering. Hats off to the author. With an index, you are in "easy" mode, so I consider the unindexed "hard mode" the big accomplishment.
I am still digesting it. So if I read the paper correctly the false positive rate is approximately 1 for every 5GB. Very reasonable.
You are right in the index being the easy-mode. Over the years there have been lots of implementations trying to add an index like that to the gzip metadata itself or as a sidecar file, with bgzip probably being the most known one. None of them really did stick, hence the necessity for some generic multi-threaded decompressor. A probably incomplete list of such implementations can be found in this issue: https://github.com/mxmlnkn/rapidgzip/issues/8
The index makes it so easy that I can simply delegate decompression to zlib. And since paper publication I've actually improved upon this by delegating to ISA-l / igzip instead, which is twice as fast. This is already in the 0.8.0 release.
As derived from table 1, the false positive rate is 1 Tbit / 202 = 5 Gbit or 625 MB for deflate blocks with dynamic Huffman code. For non-compressed blocks, the false positive rate is roughly one per 500 KB, however non-compressed blocks can basically be memcpied or skipped over and then the next deflate header can be checked without much latency. On the other hand, for dynamic blocks, the whole block needs to be decompressed first to find the next one. So the much higher false positive rate for non-compressed blocks doesn't introduce that much overhead. Furthermore, you need to consider that the block finder is basically only applied to the first ~0-128 KiB or so data of each 4 MiB large block (until it finds a valid one). So, in terms of real input archive size, the effective false positive rate is smaller by a factor of 32.
I have some profiling built into rapidgzip, which is printed with -v, e.g., rapidgzip -v -d -o /dev/null 20xsilesia.tar.gz :
Time spent in block finder : 0.227751 s
Time spent decoding : 20.4015 s
Time spent allocating and copying : 2.71328 s
Time spent applying the last window : 1.39377 s
Replaced marker bytes : 1 GiB 497 MiB 311 KiB 503 B
The block finder adds about 1% overhead and the marker replacement about 7% overhead. It probably would be nice to add a statistic for the false positive rate to the -v output.
klauspost|2 years ago
As someone who has dabbled quite a lot with deflate this was a very interesting find. It seems like the format that never wants to die.
First of all I am surprised this is even possible. Given the extremely minimal and non-aligned nature of the deflate headers, I actually discarded this as being feasible, with the false positive rate likely being too high.
The paper and implementation is an amazing piece of engineering. Hats off to the author. With an index, you are in "easy" mode, so I consider the unindexed "hard mode" the big accomplishment.
I am still digesting it. So if I read the paper correctly the false positive rate is approximately 1 for every 5GB. Very reasonable.
mxmlnkn|2 years ago
You are right in the index being the easy-mode. Over the years there have been lots of implementations trying to add an index like that to the gzip metadata itself or as a sidecar file, with bgzip probably being the most known one. None of them really did stick, hence the necessity for some generic multi-threaded decompressor. A probably incomplete list of such implementations can be found in this issue: https://github.com/mxmlnkn/rapidgzip/issues/8
The index makes it so easy that I can simply delegate decompression to zlib. And since paper publication I've actually improved upon this by delegating to ISA-l / igzip instead, which is twice as fast. This is already in the 0.8.0 release.
As derived from table 1, the false positive rate is 1 Tbit / 202 = 5 Gbit or 625 MB for deflate blocks with dynamic Huffman code. For non-compressed blocks, the false positive rate is roughly one per 500 KB, however non-compressed blocks can basically be memcpied or skipped over and then the next deflate header can be checked without much latency. On the other hand, for dynamic blocks, the whole block needs to be decompressed first to find the next one. So the much higher false positive rate for non-compressed blocks doesn't introduce that much overhead. Furthermore, you need to consider that the block finder is basically only applied to the first ~0-128 KiB or so data of each 4 MiB large block (until it finds a valid one). So, in terms of real input archive size, the effective false positive rate is smaller by a factor of 32.
I have some profiling built into rapidgzip, which is printed with -v, e.g., rapidgzip -v -d -o /dev/null 20xsilesia.tar.gz :
The block finder adds about 1% overhead and the marker replacement about 7% overhead. It probably would be nice to add a statistic for the false positive rate to the -v output.