Xz format inadequate for long-term archiving (2016)

[+] st_goliath|1 year ago|reply

This article is originally from 2016, not from 2022, as has also been pointed out when it was last posted circa 11 hours ago.

2024: https://news.ycombinator.com/item?id=39868810

2022: https://news.ycombinator.com/item?id=32210438

2019: https://news.ycombinator.com/item?id=20103255

2018: https://news.ycombinator.com/item?id=16884832

2016: https://news.ycombinator.com/item?id=12768425

It is utterly unrelated to the ongoing issue (which I'm 100% sure is the reason this is on the front page for a second time within 12 hours right now).

The author is also the developer of lzip, while it does have some valid points, one should apply a necessary level of skepticism ("Competitors product is really bad and unfit", says vendor).

[+] Lockal|1 year ago|reply

While you are saying that it is utterly unrelated, I'll respond with "that not only related, this is a root cause of current situation".

We have a single massively used implementation (also known as reference), which does not even follow xz format specification. It is autohell-based, which allowed m4 injection. In other places it is CMake-based, but uses autohell-based code, inserting hidden dots where they could just use https://cmake.org/cmake/help/latest/module/CheckSymbolExists...

"Poor design" is still there, as a core of library. It is as cool, 64-bit square roots and 128-bit divides in Bcachefs[1]

It is as cool as lbzip2[2] with their:

  (((0xffffaa50 >> ((as < 0x20 ? as : (as >> 4)) & 0x1e)) & 0x3) + (as < 0x20 ? 1 : 5))

Spoiler: this lbzip2 code produces corrupted files in some cases, should we care if it is a backdoor? Or as usual, disable optimizations, disable valgrind, disable fuzzers and say that everything is ok?

[1] https://www.phoronix.com/news/Linux-6.9-Bcachefs-Attempt

[2] https://github.com/kjn/lbzip2/blob/b6dc48a7b9bfe6b340ed1f6d7...

[+] tex0|1 year ago|reply

Thank you for reposting. I don't want to start bashing XZ, but I honestly wonder why it's been picked up so much despite the valid criticism.

As if the compression rate increase at medium levels were so significant over bzip2 or grip.

[+] mort96|1 year ago|reply

I never really understood the point of this article. The way to have reliable long term storage isn't to use compression formats which are resilient against bit flips; none of them really are. The way to have reliable long term storage is to protect against bit flips using checksums and redundancy.

These are alright criticisms of the xz format, but nothing which would make me wary of using xz for long term archival.

[+] livrem|1 year ago|reply

I liked that the article begins with the Hoare quote "One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies" that was also the exact first thing I thought about when reading about the mess with the backdoor yesterday.

[+] lifthrasiir|1 year ago|reply

That's irrelevant to the file formats themselves, however. A simple file format doesn't automatically guarantee a simple implementation, especially for compression algorithms.

[+] cess11|1 year ago|reply

Over the past two-three decades of computering I've only noticed xz in deb-packages.

Who's using xz as a long-term archiving or general purpose compression format? TFA doesn't give any examples of this, only that the Debian project picked it for packages, presumably because it creates smaller files than gzip and bzip2. And apt packages are arguably very ephemeral, if a compressed file turned out bad or bitrotted before culling from the repo it's likely to be discovered and fixed easily.

I work in e-archiving, with public sector clients, where we actually do long-term storage of whatever data format. Never seen xz used there.

[+] lifthrasiir|1 year ago|reply

Does e-archiving use bzip2? I think xz was mostly advertised (and successfully adopted) as a substitute for bzip2, because bzip2 was even slower than xz.

[+] tutfbhuf|1 year ago|reply

I wonder why no one mentioned Zstandard yet. It is 10x faster to decompress than most other formats and offers comparable compression ratios.

[+] user20180120|1 year ago|reply

Why is the Long Range Zip lrzip compression format not used? It gives better compression than xz when using the correct switches.

[+] JoshTriplett|1 year ago|reply

zstd has a long-range mode, and is far more widely used.

[+] m463|1 year ago|reply

Took me a while to realize tar format doesn't have any crc/checksum.

(usually compression format has it though)

[+] lifthrasiir|1 year ago|reply

I can see why the lzip author frustrated at the xz format, especially given that there are so many checksums and paddings around, but the lzip format is the opposite extreme.

7-zip already had a concept of multiple filters which contribute to its efficiency, and the underlying design of xz does capture them without much complication. For example, filters in the original 7-zip format (or "codecs") can have both multiple input and output streams [1]. This makes less sense for a single file compressor and xz carefully avoided them. The main problem with the xz format is not its concept but more about its concrete implementation: you don't need extensibility, you only need agility.

In comparison, lzip is too minimal. It might be technically agile by its version field, but it wouldn't if you do nothing and claim that you are open to any addition. It is not hard to pick some filters and mandate only most useful combinations of filters. The stream could have been periodically interrupted to give an early chance to detect errors before the member footer. (Unless lzip natively produces a multimember file even for a single input, which is AFAIK not the case.) The lzip author claims that a corruption in the compressed data can be detected from the decompression process, but that would mean too much redundancy in the compressed data, so this claim has been clearly misguided. And what the heck is that dictionary size coding? Compressed formats frequently make use of exponent-mantissa encodings but I have never seen an encoding where the mantissa is subtracted.

Of course, both should be avoided at this point because zstd is fast and efficient enough. Also, the file format for zstd is better than both in my opinion.

(I've posted the same comment in the older thread, and I also posted my summary of all three file formats so that you can feel what I'm talking about: https://news.ycombinator.com/item?id=39873112)

[1] https://py7zr.readthedocs.io/en/latest/archive_format.html#c...

[+] Bulat_Ziganshin|1 year ago|reply

afaik, 7-zip filters can't have multiple inputs (at the encoding stage).

multiple outputs are necessary for filters that output multiple independent data streams such as bcj2. and they are equally useful for archivers and compressors.

(I'm author of freearc, another archiver software, and multiple compression algos)

PS: thank you for format comparison, it would be great to put xz format description onto its Wikipedia page. I already used you description to understand why attackers added 8 "random" bytes to one of their scripts - probably to "fix" crc-64 value.

[+] tgz|1 year ago|reply

[deleted]

25 comments