How are zlib, gzip and Zip related? (2013)

[+] lpage|9 years ago|reply

In other compression news, Apple open sourced their implementation of lzfse yesterday: https://github.com/lzfse/lzfse. It's based on a relatively new type of coding - asymmetric numeral systems. Huffman coding is only optimal if you consider one bit as the smallest unit of information. ANS (and more broadly, arithmetic coding) allows for fractional bits and gets closer to the Shannon limit. It's also simpler to implement than (real world) Huffman.

Unfortunately, most open source implementations of ANS are not highly optimized and quite division heavy, so they lag on speed benchmarks. Apple's implementation looks pretty good (they're using it in OS X, err, macOS, and iOS) and there's some promising academic work being done on better implementations (optimizing Huffman for x86, ARM, and FPGA is a pretty well studied problem). The compression story is still being written.

[+] twitch_checksum|9 years ago|reply

History rewritten once again...

Apple's fse implementation is based on an open-source work from an individual which predates lzfse by about ~20 months.

https://github.com/Cyan4973/FiniteStateEntropy

It's even less optimized than the original one, due to a few errors that slipped through during lzfse conception.

http://encode.ru/threads/2221-LZFSE-New-Apple-Data-Compressi...

[+] twotwotwo|9 years ago|reply

More compression news!: https://github.com/Cyan4973/zstd declared its format stable a couple days ago, getting a step towards 1.0. In the example there, it's a little denser than gzip -1 and a lot faster, and includes entropy encoding that uses finite state machines (and that I don't truly understand--maybe there's some similarity to lzfse there). It can look at your data and construct a static dictionary, which sounds cool for tasks like packing a bunch of 4kb database pages.

(https://github.com/google/gipfeli is another compressor aiming at that general space ('fast but not Snappy/LZ4 fast'), but I don't think it caught on much.)

[+] legulere|9 years ago|reply

I followed your link which leads me to rant a bit:

LZFSE has been out since one year. Not one mention on wikipedia. The repo lacks a good description what lzfse is. It also contains a LZVN encoder/decoder. The only information I found about it is some blog where someone seems to reverse engineer it for some hackintosh purposes.

I know documenting and presenting the case why people should use your software/file format can be annoying to do, but it's really important.

[+] mayoff|9 years ago|reply

Strange that Apple didn't publish it on https://github.com/apple (which holds all of their open Swift repos).

Also, to clarify, ANS is relatively new (2009) but arithmetic coding has been around for a long time. Historically it was avoided because of patents, many (all?) of which have now expired. Apparently ANS isn't going to be patented.

[+] eln1|9 years ago|reply

Apple tANS implementation was suboptimal - both from the quantization and symbol spread point of view: http://encode.ru/threads/2221-LZFSE-New-Apple-Data-Compressi... The author of this post (Eric Biggers) has made corrections and one of them was already merged in lzfse github (the second would make it incompatible).

There is a paper about FPGA implementation of tANS encoder ( ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7306068 ), but it uses branch which is removed in more recent implementations (bottom of the first post of http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-Sys... ).

It is worth to mention that rANS variant - using one multiplication per symbol (e.g. in Google VP10), has recently exceeded the speed of tANS/FSE: https://github.com/jkbonfield/rans_static

[+] ralfd|9 years ago|reply

Interesting, but it is all spanish to me ... What are "fractional bits"?

[+] chickenbane|9 years ago|reply

Not only is this a great read, but the follow up for citations is replied with "I am the reference".

If this were reddit I'd post the hot fire gif. Eh, here it's anyway: http://i.imgur.com/VQLGJOL.gif

[+] twotwotwo|9 years ago|reply

Neat how involved he still is this long after the initial work.

When brotli came out he implemented his own decoder (https://github.com/madler/brotli/) to check the spec, and got some things clarified in RFC. Brotli is pretty DEFLATE-y in spirit, with changes that take advantage of today's hardware (larger history window, contexts in the entropy encoding) and a bunch of tweaks (different match encoding, static dictionary, etc.).

Also, as others mention, NASA mission engineer: https://en.wikipedia.org/wiki/Mark_Adler#Mars_exploration

[+] shostack|9 years ago|reply

Reminds me of the time I took my friend to play D&D up in Lake Geneva with the Gygax family whom I had been gaming with for a bit.

He ended up getting into an argument with Gary about why dwarven women would have beards (of all things). I remember having to kick him under the table and quickly whisper "This is Gary Gygax, he made the game. If he says they have beards they have beards!"

[+] dragontamer|9 years ago|reply

I thought he was joking.

Then I read the name: Adler: as in Adler-32 (the checksum function that zlib uses).

Then I knew it was real. The darn author of zlib answered the question. That's as close as you're gonna get to "primary source" folks!

[+] ape4|9 years ago|reply

The list of methods looks exactly like this

https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip.ht...

[+] sikhnerd|9 years ago|reply

It's annoyingly common how the OP doesn't mark this answer as accepted, or even acknowledge how amazing this answer is from one of the technology's creators -- instead just goes on to ask a followup.

[+] mixmastamyk|9 years ago|reply

The accepted answer is often wrong (or obsolete) and grants a small number of points. The obsession with it on SO is a bit odd.

[+] coryfklein|9 years ago|reply

Out of curiosity, why do you care whether the answer is marked as accepted?

[+] kbenson|9 years ago|reply

It seems like it wouldn't be that hard to create an indexed tar.gz format that's backwards compatible.

One way would be to use the last file in the tar as the index, and as files are added, you can remove the index, append the new file, append some basic file metadata and the compressed offset (maybe of the deflate chunk) into the index, update the index size in bytes in a small footer at the end of the index, and append to the compressed tar (add).

You can retrieve the index by starting at the end of the compressed archive, and reading backwards until you find a deflate header (at most 65k plus a few more bytes, since that's the size of a deflate chunk), If it's an indexed tar, the last file will be the index, and the end of the index will be a footer with the index size (so you know the maximum you'll need to seek back from the end). This isn't extremely efficient, but it is limited in scope, and helped by knowing the index size.

You could verify the index by checking some or all of the reported file byte offsets. Worst case scenario is small files with one or more per deflate chunk, and you would have to visit each chunk. This makes the worst case scenario equivalent to listing files an un-indexed tar.gz, plus the overhead of locating and reading the index (relatively small).

Uncompressing the archive as a regular tar.gz would result in a normal operation, with an additional file (the index) included.

I imagine this isn't popular is not because it hasn't been done, but because most people don't really need an index.

[+] ww520|9 years ago|reply

The tar.gz format is to combine a collection of files first as a single tar file and then compress the one file into gz. It's not compressing individual file and then combing the compressed files into a tar. Accessing the last file requires decompressing the whole archive. What you described is the zip format. Both zip and gz use the DEFLATE algorithm so there's no difference in compression. The difference is in how the files are packaged.

Also gzip supports on-the-fly stream-based compression and decompression so anything requires jumping to the end to do reading and writing while compression/decompresion is no go.

[+] bluedino|9 years ago|reply

After writing one myself along with a guy on IRC who needed a 'quick way to access huge .tar files', I found one written in python: https://github.com/devsnd/tarindexer

He ended up doing the smart thing and changing his archiving program to not use .tar files, which solved all of his problems, but for the 2TB of data he already had this worked rather well.

[+] TimMeade|9 years ago|reply

If I recall correctly and it's been a while: The initial header is x bytes from the beginning of the file, or that you have to search for a known key string PKsomething. Then you have to go back and forth between that and the compressed data as you decompress. When compressing files you have to go back to the beginning of the zip file ( or disk1 in a multi disk archive) and update that table with the CRC info. Just was not practical for streaming for multiple reasons. Remember the original zip file format was created by phil in about 1987. I believe they just thought it easier to start over with a better design for gzip with the same compression algorithm.

[+] greglindahl|9 years ago|reply

One community that does use this technique (although the index is in a separate file) is the web archiving community's WARC file format. The Internet Archive Wayback Machine has ~ 13 petabytes of web archive stored that way, and you use it every time you look at a webpage in the Wayback. The file format is what you'd expect: each webpage is gzipped separately, the output is catted together, and the index stores the offset and length of each individual compressed webpage.

https://en.wikipedia.org/wiki/Web_ARChive

[+] rdslw|9 years ago|reply

Worth reading answer about coolest kid on the block: xz compression algorithm (of lzma fame) plus tar.gz vs tar.xz scenarios/discussion.

http://stackoverflow.com/questions/6493270/why-is-tar-gz-sti...

[+] ygra|9 years ago|reply

Did you intend to link the answer, maybe?

[+] digi_owl|9 years ago|reply

The last few days i find myself wondering if there needs to be some kind of org set up to preserve this sort of info.

Right now it seems to be strewn across a myriad of blogs, forums and whatsnot that risk going poof. And even if the Internet Archive picks them up, it is anything but curated (unlike say wikipedia, even with all the warts).

[+] greglindahl|9 years ago|reply

The Internet Archive has both curated and non-curated collections. Right now the Wayback Machine doesn't have any curation, but that doesn't mean it's going to stay that way.

However, in this situation, the collaboration between Wikipedia and the Wayback may be what you had in mind. We're working with Wikipedia to make sure all external links from Wikipedia articles are backed up in the Wayback Machine, and there's a Wikipedia bot going around adding these links to articles.

[+] winterismute|9 years ago|reply

My father teaching me to type PKUNZIP on files that "ended with .zip" in the DOS shell (not long before the Norton Commander kind of GUI arrived to our computer) is one of my earliest memories as a toddler: I would ask him "What does it mean?" and he would simply not know. It was 1990 and I was 3 and a half I think. When I learned what it stood for it was kind of epic, for me.

[+] hardwaresofton|9 years ago|reply

It is rare to be able to have a question answered so completely and from such a first-hand source. This post is gold and tickles me in all the right places.

StackOverflow is sitting on a veritable treasure trove of knowledge.

[+] lunchables|9 years ago|reply

Reminds me of the very sad zip story:

https://www.youtube.com/watch?v=_zvFeHtcxuA

The whole "The BBS Documentary" is great and I recommend starting at the beginning if you're interested in it.

https://www.youtube.com/watch?v=dRap7uw9iWI

[+] 404-universe|9 years ago|reply

Where do the other popular compression utilities (e.g. bzip2, xzip, lzma, 7zip) fit in to this?

[+] the_common_man|9 years ago|reply

One important difference in practice is that zip files needs to be saved to disk to be extracted. gzip files on the other hand can be stream unzipped i.e curl http://example.com/foo.tar.gz | tar zxvf - is possible but not with zip files. I am not sure if this is a limitation of the unzip tool. I would love to know if there is a work around to this.

[+] tobias3|9 years ago|reply

Zip files can be streamed. See for example https://developer.android.com/reference/java/util/zip/ZipInp...

It even allows you to stream the zip while creating it, i.e. you do not need to know the compressed size of the files before starting to write the compressed file content.

[+] bluedino|9 years ago|reply

You can't do that with a .zip file because the file header information is actually at the end of the file. You could work around that by sending the header first, though.

[+] tdicola|9 years ago|reply

Wow a stackoverflow question that hasn't been closed or removed for some trivial reason--thought I'd never see something like that again.

[+] throwanem|9 years ago|reply

That's funny; it was on hold when I saw it last, a few hours ago. Apparently whoever happens to be around with mod powers decided to hate fun a little less than usual. I'm sure it won't last.

[+] dc2|9 years ago|reply

I feel Stack Overflow mods have to a least some degree a different purpose in the site than the average user. Far too many times, I find that the answers judged bad by mods or close-voted are the ones that ended up helping me. Isn't that the point?

[+] pavanky|9 years ago|reply

It was pretty close. It had 4 close votes earlier in the day which seem to have disappeared now.

[+] Zarathust|9 years ago|reply

Stackoverflow current objective is to answer precise technical questions related to programming. Open ended questions are generally frowned upon because they tend to be argumentative and generate debate, which SO is not well equipped to deal with. In the very beginnings of stackoverflow, the goals weren't as clearly defined and because of the overwhelming need for such a place, it quickly became flooded with any question related to computers. This is why serverfault and superuser were created along clearer rules regarding their respective content.

I am not sure why this question made it through, but "what do they have in common" generates a finite set of answers while "what are their differences" does not. The "real" place to ask such questions should be programmers.stackexchange.com. Unfortunately, it has 30 times less users and frankly, no one really cares about it. Moderators should use "Migrate question to [...]" instead of simply closing it.

I think everyone will agree that a question with 500 upvotes and 150k page views is interesting. The full problem is that SO has not managed to fit it anywhere.

[+] minionslave|9 years ago|reply

It's kinda bad-ass when he said: you can use this text on Wikipedia, I'm the primary reference.

[+] danso|9 years ago|reply

I wonder if that's a bit of snark about how, if he were to actually edit the Wikipedia entry on the topic, his edits would be removed because of Wikipedia's policies on primary sources contributing to an entry.

[+] Grue3|9 years ago|reply

Wikipedia mostly doesn't allow primary sources though. It needs to be corroborated by a reliable secondary source.

[+] agumonkey|9 years ago|reply

But did he won the putnam ?

[+] coryfklein|9 years ago|reply

I love the discussion in the comments:

> This post is packed with so much history and information that I feel like some citations need be added incase people try to reference this post as an information source. Though if this information is reflected somewhere with citations like Wikipedia, a link to such similar cited work would be appreciated. - ThorSummoner

> I am the reference, having been part of all of that. This post could be cited in Wikipedia as an original source. – Mark Adler

[+] virtualized|9 years ago|reply

But can he invert a binary tree and is he willing to relocate to San Francisco?

[+] adontz|9 years ago|reply

When I read "I am the reference" it reminded me of "I am the danger".

https://www.youtube.com/watch?v=3v_zlyHgazs

[+] agumonkey|9 years ago|reply

[deleted]

[+] FunctionalMan|9 years ago|reply

[deleted]

[+] new_hackers|9 years ago|reply

When Chuck Norris computes checksums, he uses Adler-32

[+] agumonkey|9 years ago|reply

Archived link juste in case http://archive.is/SvUO5

[+] duskwuff|9 years ago|reply

The post is over two years old and has a score of +892 (easily in the top 0.1% of posts). It's not going anywhere.

[+] sigjuice|9 years ago|reply

I appreciate your effort and sentiment, but it would be nice if the awful eyesores of ads were not plastered all over the page.

138 comments