In other compression news, Apple open sourced their implementation of lzfse yesterday: https://github.com/lzfse/lzfse. It's based on a relatively new type of coding - asymmetric numeral systems. Huffman coding is only optimal if you consider one bit as the smallest unit of information. ANS (and more broadly, arithmetic coding) allows for fractional bits and gets closer to the Shannon limit. It's also simpler to implement than (real world) Huffman.
Unfortunately, most open source implementations of ANS are not highly optimized and quite division heavy, so they lag on speed benchmarks. Apple's implementation looks pretty good (they're using it in OS X, err, macOS, and iOS) and there's some promising academic work being done on better implementations (optimizing Huffman for x86, ARM, and FPGA is a pretty well studied problem). The compression story is still being written.
More compression news!: https://github.com/Cyan4973/zstd declared its format stable a couple days ago, getting a step towards 1.0. In the example there, it's a little denser than gzip -1 and a lot faster, and includes entropy encoding that uses finite state machines (and that I don't truly understand--maybe there's some similarity to lzfse there). It can look at your data and construct a static dictionary, which sounds cool for tasks like packing a bunch of 4kb database pages.
(https://github.com/google/gipfeli is another compressor aiming at that general space ('fast but not Snappy/LZ4 fast'), but I don't think it caught on much.)
I followed your link which leads me to rant a bit:
LZFSE has been out since one year. Not one mention on wikipedia. The repo lacks a good description what lzfse is. It also contains a LZVN encoder/decoder. The only information I found about it is some blog where someone seems to reverse engineer it for some hackintosh purposes.
I know documenting and presenting the case why people should use your software/file format can be annoying to do, but it's really important.
Strange that Apple didn't publish it on https://github.com/apple (which holds all of their open Swift repos).
Also, to clarify, ANS is relatively new (2009) but arithmetic coding has been around for a long time. Historically it was avoided because of patents, many (all?) of which have now expired. Apparently ANS isn't going to be patented.
Apple tANS implementation was suboptimal - both from the quantization and symbol spread point of view: http://encode.ru/threads/2221-LZFSE-New-Apple-Data-Compressi...
The author of this post (Eric Biggers) has made corrections and one of them was already merged in lzfse github (the second would make it incompatible).
There is a paper about FPGA implementation of tANS encoder ( ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7306068 ), but it uses branch which is removed in more recent implementations (bottom of the first post of http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-Sys... ).
It is worth to mention that rANS variant - using one multiplication per symbol (e.g. in Google VP10), has recently exceeded the speed of tANS/FSE: https://github.com/jkbonfield/rans_static
Neat how involved he still is this long after the initial work.
When brotli came out he implemented his own decoder (https://github.com/madler/brotli/) to check the spec, and got some things clarified in RFC. Brotli is pretty DEFLATE-y in spirit, with changes that take advantage of today's hardware (larger history window, contexts in the entropy encoding) and a bunch of tweaks (different match encoding, static dictionary, etc.).
Reminds me of the time I took my friend to play D&D up in Lake Geneva with the Gygax family whom I had been gaming with for a bit.
He ended up getting into an argument with Gary about why dwarven women would have beards (of all things). I remember having to kick him under the table and quickly whisper "This is Gary Gygax, he made the game. If he says they have beards they have beards!"
It's annoyingly common how the OP doesn't mark this answer as accepted, or even acknowledge how amazing this answer is from one of the technology's creators -- instead just goes on to ask a followup.
It seems like it wouldn't be that hard to create an indexed tar.gz format that's backwards compatible.
One way would be to use the last file in the tar as the index, and as files are added, you can remove the index, append the new file, append some basic file metadata and the compressed offset (maybe of the deflate chunk) into the index, update the index size in bytes in a small footer at the end of the index, and append to the compressed tar (add).
You can retrieve the index by starting at the end of the compressed archive, and reading backwards until you find a deflate header (at most 65k plus a few more bytes, since that's the size of a deflate chunk), If it's an indexed tar, the last file will be the index, and the end of the index will be a footer with the index size (so you know the maximum you'll need to seek back from the end). This isn't extremely efficient, but it is limited in scope, and helped by knowing the index size.
You could verify the index by checking some or all of the reported file byte offsets. Worst case scenario is small files with one or more per deflate chunk, and you would have to visit each chunk. This makes the worst case scenario equivalent to listing files an un-indexed tar.gz, plus the overhead of locating and reading the index (relatively small).
Uncompressing the archive as a regular tar.gz would result in a normal operation, with an additional file (the index) included.
I imagine this isn't popular is not because it hasn't been done, but because most people don't really need an index.
The tar.gz format is to combine a collection of files first as a single tar file and then compress the one file into gz. It's not compressing individual file and then combing the compressed files into a tar. Accessing the last file requires decompressing the whole archive. What you described is the zip format. Both zip and gz use the DEFLATE algorithm so there's no difference in compression. The difference is in how the files are packaged.
Also gzip supports on-the-fly stream-based compression and decompression so anything requires jumping to the end to do reading and writing while compression/decompresion is no go.
After writing one myself along with a guy on IRC who needed a 'quick way to access huge .tar files', I found one written in python: https://github.com/devsnd/tarindexer
He ended up doing the smart thing and changing his archiving program to not use .tar files, which solved all of his problems, but for the 2TB of data he already had this worked rather well.
If I recall correctly and it's been a while: The initial header is x bytes from the beginning of the file, or that you have to search for a known key string PKsomething. Then you have to go back and forth between that and the compressed data as you decompress. When compressing files you have to go back to the beginning of the zip file ( or disk1 in a multi disk archive) and update that table with the CRC info. Just was not practical for streaming for multiple reasons. Remember the original zip file format was created by phil in about 1987. I believe they just thought it easier to start over with a better design for gzip with the same compression algorithm.
One community that does use this technique (although the index is in a separate file) is the web archiving community's WARC file format. The Internet Archive Wayback Machine has ~ 13 petabytes of web archive stored that way, and you use it every time you look at a webpage in the Wayback. The file format is what you'd expect: each webpage is gzipped separately, the output is catted together, and the index stores the offset and length of each individual compressed webpage.
The last few days i find myself wondering if there needs to be some kind of org set up to preserve this sort of info.
Right now it seems to be strewn across a myriad of blogs, forums and whatsnot that risk going poof. And even if the Internet Archive picks them up, it is anything but curated (unlike say wikipedia, even with all the warts).
The Internet Archive has both curated and non-curated collections. Right now the Wayback Machine doesn't have any curation, but that doesn't mean it's going to stay that way.
However, in this situation, the collaboration between Wikipedia and the Wayback may be what you had in mind. We're working with Wikipedia to make sure all external links from Wikipedia articles are backed up in the Wayback Machine, and there's a Wikipedia bot going around adding these links to articles.
My father teaching me to type PKUNZIP on files that "ended with .zip" in the DOS shell (not long before the Norton Commander kind of GUI arrived to our computer) is one of my earliest memories as a toddler: I would ask him "What does it mean?" and he would simply not know. It was 1990 and I was 3 and a half I think. When I learned what it stood for it was kind of epic, for me.
It is rare to be able to have a question answered so completely and from such a first-hand source. This post is gold and tickles me in all the right places.
StackOverflow is sitting on a veritable treasure trove of knowledge.
One important difference in practice is that zip files needs to be saved to disk to be extracted. gzip files on the other hand can be stream unzipped i.e curl http://example.com/foo.tar.gz | tar zxvf - is possible but not with zip files. I am not sure if this is a limitation of the unzip tool. I would love to know if there is a work around to this.
It even allows you to stream the zip while creating it, i.e. you do not need to know the compressed size of the files before starting to write the compressed file content.
You can't do that with a .zip file because the file header information is actually at the end of the file. You could work around that by sending the header first, though.
That's funny; it was on hold when I saw it last, a few hours ago. Apparently whoever happens to be around with mod powers decided to hate fun a little less than usual. I'm sure it won't last.
I feel Stack Overflow mods have to a least some degree a different purpose in the site than the average user. Far too many times, I find that the answers judged bad by mods or close-voted are the ones that ended up helping me. Isn't that the point?
Stackoverflow current objective is to answer precise technical questions related to programming. Open ended questions are generally frowned upon because they tend to be argumentative and generate debate, which SO is not well equipped to deal with. In the very beginnings of stackoverflow, the goals weren't as clearly defined and because of the overwhelming need for such a place, it quickly became flooded with any question related to computers. This is why serverfault and superuser were created along clearer rules regarding their respective content.
I am not sure why this question made it through, but "what do they have in common" generates a finite set of answers while "what are their differences" does not. The "real" place to ask such questions should be programmers.stackexchange.com. Unfortunately, it has 30 times less users and frankly, no one really cares about it. Moderators should use "Migrate question to [...]" instead of simply closing it.
I think everyone will agree that a question with 500 upvotes and 150k page views is interesting. The full problem is that SO has not managed to fit it anywhere.
I wonder if that's a bit of snark about how, if he were to actually edit the Wikipedia entry on the topic, his edits would be removed because of Wikipedia's policies on primary sources contributing to an entry.
> This post is packed with so much history and information that I feel like some citations need be added incase people try to reference this post as an information source. Though if this information is reflected somewhere with citations like Wikipedia, a link to such similar cited work would be appreciated. - ThorSummoner
> I am the reference, having been part of all of that. This post could be cited in Wikipedia as an original source. – Mark Adler
[+] [-] lpage|9 years ago|reply
Unfortunately, most open source implementations of ANS are not highly optimized and quite division heavy, so they lag on speed benchmarks. Apple's implementation looks pretty good (they're using it in OS X, err, macOS, and iOS) and there's some promising academic work being done on better implementations (optimizing Huffman for x86, ARM, and FPGA is a pretty well studied problem). The compression story is still being written.
[+] [-] twitch_checksum|9 years ago|reply
Apple's fse implementation is based on an open-source work from an individual which predates lzfse by about ~20 months.
https://github.com/Cyan4973/FiniteStateEntropy
It's even less optimized than the original one, due to a few errors that slipped through during lzfse conception.
http://encode.ru/threads/2221-LZFSE-New-Apple-Data-Compressi...
[+] [-] twotwotwo|9 years ago|reply
(https://github.com/google/gipfeli is another compressor aiming at that general space ('fast but not Snappy/LZ4 fast'), but I don't think it caught on much.)
[+] [-] legulere|9 years ago|reply
LZFSE has been out since one year. Not one mention on wikipedia. The repo lacks a good description what lzfse is. It also contains a LZVN encoder/decoder. The only information I found about it is some blog where someone seems to reverse engineer it for some hackintosh purposes.
I know documenting and presenting the case why people should use your software/file format can be annoying to do, but it's really important.
[+] [-] mayoff|9 years ago|reply
Also, to clarify, ANS is relatively new (2009) but arithmetic coding has been around for a long time. Historically it was avoided because of patents, many (all?) of which have now expired. Apparently ANS isn't going to be patented.
[+] [-] eln1|9 years ago|reply
There is a paper about FPGA implementation of tANS encoder ( ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7306068 ), but it uses branch which is removed in more recent implementations (bottom of the first post of http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-Sys... ).
It is worth to mention that rANS variant - using one multiplication per symbol (e.g. in Google VP10), has recently exceeded the speed of tANS/FSE: https://github.com/jkbonfield/rans_static
[+] [-] ralfd|9 years ago|reply
[+] [-] chickenbane|9 years ago|reply
If this were reddit I'd post the hot fire gif. Eh, here it's anyway: http://i.imgur.com/VQLGJOL.gif
[+] [-] twotwotwo|9 years ago|reply
When brotli came out he implemented his own decoder (https://github.com/madler/brotli/) to check the spec, and got some things clarified in RFC. Brotli is pretty DEFLATE-y in spirit, with changes that take advantage of today's hardware (larger history window, contexts in the entropy encoding) and a bunch of tweaks (different match encoding, static dictionary, etc.).
Also, as others mention, NASA mission engineer: https://en.wikipedia.org/wiki/Mark_Adler#Mars_exploration
[+] [-] shostack|9 years ago|reply
He ended up getting into an argument with Gary about why dwarven women would have beards (of all things). I remember having to kick him under the table and quickly whisper "This is Gary Gygax, he made the game. If he says they have beards they have beards!"
[+] [-] dragontamer|9 years ago|reply
Then I read the name: Adler: as in Adler-32 (the checksum function that zlib uses).
Then I knew it was real. The darn author of zlib answered the question. That's as close as you're gonna get to "primary source" folks!
[+] [-] ape4|9 years ago|reply
https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip.ht...
[+] [-] sikhnerd|9 years ago|reply
[+] [-] mixmastamyk|9 years ago|reply
[+] [-] coryfklein|9 years ago|reply
[+] [-] kbenson|9 years ago|reply
One way would be to use the last file in the tar as the index, and as files are added, you can remove the index, append the new file, append some basic file metadata and the compressed offset (maybe of the deflate chunk) into the index, update the index size in bytes in a small footer at the end of the index, and append to the compressed tar (add).
You can retrieve the index by starting at the end of the compressed archive, and reading backwards until you find a deflate header (at most 65k plus a few more bytes, since that's the size of a deflate chunk), If it's an indexed tar, the last file will be the index, and the end of the index will be a footer with the index size (so you know the maximum you'll need to seek back from the end). This isn't extremely efficient, but it is limited in scope, and helped by knowing the index size.
You could verify the index by checking some or all of the reported file byte offsets. Worst case scenario is small files with one or more per deflate chunk, and you would have to visit each chunk. This makes the worst case scenario equivalent to listing files an un-indexed tar.gz, plus the overhead of locating and reading the index (relatively small).
Uncompressing the archive as a regular tar.gz would result in a normal operation, with an additional file (the index) included.
I imagine this isn't popular is not because it hasn't been done, but because most people don't really need an index.
[+] [-] ww520|9 years ago|reply
Also gzip supports on-the-fly stream-based compression and decompression so anything requires jumping to the end to do reading and writing while compression/decompresion is no go.
[+] [-] bluedino|9 years ago|reply
He ended up doing the smart thing and changing his archiving program to not use .tar files, which solved all of his problems, but for the 2TB of data he already had this worked rather well.
[+] [-] TimMeade|9 years ago|reply
[+] [-] greglindahl|9 years ago|reply
https://en.wikipedia.org/wiki/Web_ARChive
[+] [-] rdslw|9 years ago|reply
http://stackoverflow.com/questions/6493270/why-is-tar-gz-sti...
[+] [-] ygra|9 years ago|reply
[+] [-] digi_owl|9 years ago|reply
Right now it seems to be strewn across a myriad of blogs, forums and whatsnot that risk going poof. And even if the Internet Archive picks them up, it is anything but curated (unlike say wikipedia, even with all the warts).
[+] [-] greglindahl|9 years ago|reply
However, in this situation, the collaboration between Wikipedia and the Wayback may be what you had in mind. We're working with Wikipedia to make sure all external links from Wikipedia articles are backed up in the Wayback Machine, and there's a Wikipedia bot going around adding these links to articles.
[+] [-] winterismute|9 years ago|reply
[+] [-] hardwaresofton|9 years ago|reply
StackOverflow is sitting on a veritable treasure trove of knowledge.
[+] [-] lunchables|9 years ago|reply
https://www.youtube.com/watch?v=_zvFeHtcxuA
The whole "The BBS Documentary" is great and I recommend starting at the beginning if you're interested in it.
https://www.youtube.com/watch?v=dRap7uw9iWI
[+] [-] 404-universe|9 years ago|reply
[+] [-] the_common_man|9 years ago|reply
[+] [-] tobias3|9 years ago|reply
It even allows you to stream the zip while creating it, i.e. you do not need to know the compressed size of the files before starting to write the compressed file content.
[+] [-] bluedino|9 years ago|reply
[+] [-] tdicola|9 years ago|reply
[+] [-] throwanem|9 years ago|reply
[+] [-] dc2|9 years ago|reply
[+] [-] pavanky|9 years ago|reply
[+] [-] Zarathust|9 years ago|reply
I am not sure why this question made it through, but "what do they have in common" generates a finite set of answers while "what are their differences" does not. The "real" place to ask such questions should be programmers.stackexchange.com. Unfortunately, it has 30 times less users and frankly, no one really cares about it. Moderators should use "Migrate question to [...]" instead of simply closing it.
I think everyone will agree that a question with 500 upvotes and 150k page views is interesting. The full problem is that SO has not managed to fit it anywhere.
[+] [-] minionslave|9 years ago|reply
[+] [-] danso|9 years ago|reply
[+] [-] Grue3|9 years ago|reply
[+] [-] agumonkey|9 years ago|reply
[+] [-] coryfklein|9 years ago|reply
> This post is packed with so much history and information that I feel like some citations need be added incase people try to reference this post as an information source. Though if this information is reflected somewhere with citations like Wikipedia, a link to such similar cited work would be appreciated. - ThorSummoner
> I am the reference, having been part of all of that. This post could be cited in Wikipedia as an original source. – Mark Adler
[+] [-] virtualized|9 years ago|reply
[+] [-] adontz|9 years ago|reply
https://www.youtube.com/watch?v=3v_zlyHgazs
[+] [-] agumonkey|9 years ago|reply
[deleted]
[+] [-] FunctionalMan|9 years ago|reply
[deleted]
[+] [-] new_hackers|9 years ago|reply
[+] [-] agumonkey|9 years ago|reply
[+] [-] duskwuff|9 years ago|reply
[+] [-] sigjuice|9 years ago|reply