Why are tar.xz files 15x smaller when using Python's tar compared to macOS tar?

[+] gwern|5 years ago|reply

Sorting the files in an archive is a great trick to know for efficiency ( https://www.gwern.net/Archiving-URLs#sort-key-compression-tr... ); the more redundant your files are, the better it works and the savings can be enormous. Two near-duplicate files next to each other in a compressor's stream are basically free; space them by a few dozen megabytes, and you pay full-price...

[+] LeoPanthera|5 years ago|reply

A good ugly hack is to reverse the filenames before sorting them, which has the side effect of grouping all the same file extensions together.

[+] tayo42|5 years ago|reply

Isn't the whole point of compression algorithms to find string of bytes that are the same? Why does it struggle if its out of order or not next to each other? Seems simple to detect and handle?

[+] megablast|5 years ago|reply

Sorting by what?? Type.

[+] jasode|5 years ago|reply

Fyi... not mentioned explicitly in that Q&A is the related concept : https://en.wikipedia.org/wiki/Solid_compression

[EDIT to reply] : >The difference was just the order in which the files were concatenated.

Yes, right. My link just emphasizes the files concatenation aspect (e.g. tar is one example) for people who base the intuition on the ZIP format in which files' order doesn't matter because each file is compressed separately.

[+] simias|5 years ago|reply

Note that in both cases the archive was first put into a single tar file, then compressed. That's how all tar formats function. The difference was just the order in which the files were concatenated.

[+] kzrdude|5 years ago|reply

This reminds me of discovering that some of my backups had bad blocks on the disk. And the backup is in the shape of a contiguous .tar.lzo archive of around 50 GB size; sometimes solid archives like that are not the most robust option.. Fortunately I have other full snapshots.

[+] userbinator|5 years ago|reply

If I compare the two tar archives directly, they seem different

I like the fact that this user considered whether the filesystem was lying about the size.

It's interesting to see the problem-solving approach; my next step would be to hexdump both and see if one isn't actually compressed as much as it could be, then decompress and inspect the tars.

I'm not sure whether it's true here, but one situation that leads to seemingly randomly-ordered files is when the archiver tries to be faster by using multiple threads.

The XZ blocksize is another variable to consider.

[+] 0xdky|5 years ago|reply

FWIK, compression works by de-duplication. Finding duplicate patterns is limited to a window.

If similar patterns are close to each other, there is a higher probability of finding such duplicates in the window leading to better compression.

When the files are not sorted, this randomly distributed files with similar patterns beyond the compression window leading to poor compression.

If there is an option to increase the size of window, that would be a good experiment.

This is very similar to `git repack` window and depth parameters. Larger the window and depth, you get better compressed packs.

Wonder if a sort based on diffs (group similar files together) would help get best compression. The cost of such sorting might outweigh the benefits.

[+] fuzzy2|5 years ago|reply

That’s not the case here. What you describe is basically compression in “solid blocks”. `xz`, however, is (in practice virtually always) fully solid. It allows greater compression efficiency at the expense of random access, which is not possible at all.

The order matters because it “defines” what the algorithm’s dictionary will look like.

[+] TheGuyWhoCodes|5 years ago|reply

This isn't really specific to tar. For example if you save data to parquet, sorting the data before persisting it can make a huge difference in the size of the file AND the predicate pushdown performance (as you can skip more pages especially if you sort by the column you filter on)

[+] war1025|5 years ago|reply

Unrelated to the issue in the post, but a few years back we ran into an annoying oddity in the Python zipfile module [1].

By default, when you create a new zipfile object to create a .zip archive with, it stores the data uncompressed.

Only noticed the issue when a 100MB zip file suddenly became 3MB after I had edited one of the files and re-saved the archive.

[1] https://docs.python.org/3/library/zipfile.html

[+] masklinn|5 years ago|reply

Yes, it's clearly documented but very much unexpected the first time 'round.

The reason for the default is probably that python can be built with none of zlib, bzip2 or lzma support:

> If ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA is specified but the corresponding module (zlib, bz2 or lzma) is not available, RuntimeError is raised.

in which case only STORED is available, therefore that's the only possible default, any other would lead to zipfile raising errors out of the box for some people.

It would probably have been a better idea to just not use a default, and require the compression method to always be specified, at least when opening zipfiles with modes other than "r".

[+] Jiocus|5 years ago|reply

This isn't an oddity or issue with Python zip creation, it's expected behaviour as per ISO/IEC WD 21320-1[1].

> The compression method shall be either 0 (“stored”) or 8(“deflated”).

The format does not prescribe a compression algorithm (many could be available for use). It's merely a container.

[1] ISO/IEC WD 21320-1, Document Container File - Part 1: Core - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39...

[+] nomy99|5 years ago|reply

Also a bit unrelated, but at my work place we reimplemented tar algorithm so we can run it in a lambda/state machine. We used the s3 multipart upload to help with the file transfer bit. The performance was phenomenal. On avg we were tar'ing 20gb in 15 seconds or so, a lot faster than on my pc.

[+] OskarS|5 years ago|reply

Tar (as opposed to gzip) is almost entirely I/O bound, it's just reading from the filesystem, adding some headers and then outputting it again. Whatever speed your I/O runs at, you will basically tar at that speed as well.

[+] gnulinux|5 years ago|reply

I don't understand this comment. You can run arbitrary code in AWS lambda, you can just run "tar". Why did you need to reimplement it?

[+] chrisdbanks|5 years ago|reply

Do any compression algorithms use and approximate matching technique like ssdeep to sort the files by similarity before compression starts? It might be slower but if the compression was 15x better then it might be worth it in some scenarios.

[+] thomasmg|5 years ago|reply

Sorting chunks by similarity: commonly used tools don't do that. Most archive tools only sort by file type.

I wrote a tool that chunks the data (into variable-sized blocks, to re-sync if there are multiple files that have different length prefixes, but that's another story), and then sorts the chunks by LSH (locality sensitive hash). LSH is used by search engines to detect similar text. It can compress directories that contain multiple version of e.g. source code very well (e.g. trunk, branches). https://github.com/h2database/h2database/blob/master/h2/src/...

I discussed this approach with a researcher in this area in January 2020. AFAIK there is active research in this area, specially to compress DNA sequences. But he also wasn't aware of papers or research in this area for general-purpose data compression.

So, I think this area is largely uncharted. I would be interested (as a hobby side project) to help, if somebody is interested.

Summary:

* Regular compression algorithms use a certain "window size": this can be 32 KB, or 2 GB, or so. But sorting chunks can be faster faster and can improve the compression rate.

* Regular compression tools can sort by file name and file type.

* Data de-duplication tools (e.g. enterprise backup tools) chunk the data in a smart way, but then only support exact duplicates (not near duplicates).

* Locality sensitive hashing: it is not used in common compression tools so far AFAIK.

[+] usr1106|5 years ago|reply

Nice trick with the sorting. We use xz compressed Linux root filesystem images quite a bit. I wonder whether having arranged the files in the filesystem images could achieve a significant saving. Unlikely to be a factor of 15, but 20-30 wouldn't be bad either.

[+] mxcrossb|5 years ago|reply

There’s a lot of research in the field of scientific computing aimed at exactly this kind of data. You can definitely gain a lot of your floating point data is varying slowly.

[+] jwilk|5 years ago|reply

tarfile sorts names only since Python 3.7:

https://bugs.python.org/issue30693

[+] gspr|5 years ago|reply

Semi-related fun fact: I've experienced a large speedup doing MD5 checksumming with Python's hashlib instead of md5sum(1).

[+] jaynetics|5 years ago|reply

tl;dr: the python lib adds files to the archive sorted by the last modification date by default. this happened to be better in this case, but macOS tar had the same results when using the appropriate sorting flag.

Makes me wonder: considering the speed of modern SSDs, would it make sense for compression tools to try various sortings by default, or compare files up front?

[+] js2|5 years ago|reply

> macOS tar had the same results when using the appropriate sorting flag

They had to install and use GNU tar to gain the `--sort` option. macOS (BSD) tar doesn't have it. (You could emulate the behavior by using `find` and passing the member names in on the command-line.)

[+] masklinn|5 years ago|reply

> macOS tar had the same results when using the appropriate sorting flag.

No, macOS has the same results when using gnutar which provides sorting option.

bsdtar (which macOS provides out of the box) has no such option. It just stores files in whichever order you provide them (on the command-line, via the CLI) or it finds them (for recursive tar, so whichever order the directory returns them in).

[+] tomatocracy|5 years ago|reply

Another option is to use something like lrzip as the compressor - not sure if it would have helped in this instance but it incorporates a step which looks for long-range redundancy which can make a huge difference in these sorts of scenarios.

[+] JulianWasTaken|5 years ago|reply

Seems better to have some separate higher level metacompression tool which does the heuristic searching across many possible compression formats and parameters.

[+] formerly_proven|5 years ago|reply

Compression is generally CPU bound for most of the typical algorithms (LZMA, zlib/deflate) unless your drive is very very slow (<10 MB/s). There are some quite fast algorithms where this could make sense, basically using something like LZ4 as a proxy for compress-ability. This is the approach some tools use when you tell them to "compress, but only if it makes sense".

[+] sgt|5 years ago|reply

Even better when you switch on the --lossy option in gzip.

[+] modderation|5 years ago|reply

You may be interested in Lossy Zip [1]. It's coming up on 22 years old, as of April 1st.

Admittedly, it's somewhat unclear why the format didn't take off. At its core, it uses the Lessis-Moore algorithm to perform a global bit-wise sort of its inputs, then performs a run-length encoding (plus some other magic) on the results to achieve up to 100% compression.

[1] http://web.archive.org/web/20050402203136/http://lzip.source...

[+] gruez|5 years ago|reply

At that point you might as well store your files in /dev/null

[+] LeoPanthera|5 years ago|reply

Although you are joking, gifsicle does have a "--lossy" switch which allows changes to a gif file in order to make the LZW compression more efficient.

[+] ObscureScience|5 years ago|reply

I assume you're joking... I've never heard of lossy compression in this kinds of formats, as there's no general way to determine what can be discarded without affecting the utility of the data. There's no such option in any implementation of gzip I've found...

[+] jdlyga|5 years ago|reply

I use gzip --lossy if I forget to submit a rough draft for school papers. It comes out decent quality, but slightly worse.

[+] Aachen|5 years ago|reply

Was hoping for an Easter egg, but it just says unrecognized option? At least on my phone it does, iirc that's a Debian Buster install that I haven't updated in three years.

[+] dheera|5 years ago|reply

What version of gzip has a lossy option?

[+] zinckiwi|5 years ago|reply

It was the best and worst of times.

[+] Theodores|5 years ago|reply

I want to know how well Brotli fares with this. Browser and server support for Brotli and Brotli with static compression is the future rather than these MS-DOS things from a long time ago.

150 comments