It is utterly unrelated to the ongoing issue (which I'm 100% sure is the reason this is on the front page for a second time within 12 hours right now).
The author is also the developer of lzip, while it does have some valid points, one should apply a necessary level of skepticism ("Competitors product is really bad and unfit", says vendor).
While you are saying that it is utterly unrelated, I'll respond with "that not only related, this is a root cause of current situation".
We have a single massively used implementation (also known as reference), which does not even follow xz format specification. It is autohell-based, which allowed m4 injection. In other places it is CMake-based, but uses autohell-based code, inserting hidden dots where they could just use https://cmake.org/cmake/help/latest/module/CheckSymbolExists...
"Poor design" is still there, as a core of library. It is as cool, 64-bit square roots and 128-bit divides in Bcachefs[1]
Spoiler: this lbzip2 code produces corrupted files in some cases, should we care if it is a backdoor? Or as usual, disable optimizations, disable valgrind, disable fuzzers and say that everything is ok?
I never really understood the point of this article. The way to have reliable long term storage isn't to use compression formats which are resilient against bit flips; none of them really are. The way to have reliable long term storage is to protect against bit flips using checksums and redundancy.
These are alright criticisms of the xz format, but nothing which would make me wary of using xz for long term archival.
I liked that the article begins with the Hoare quote "One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies" that was also the exact first thing I thought about when reading about the mess with the backdoor yesterday.
That's irrelevant to the file formats themselves, however. A simple file format doesn't automatically guarantee a simple implementation, especially for compression algorithms.
Over the past two-three decades of computering I've only noticed xz in deb-packages.
Who's using xz as a long-term archiving or general purpose compression format? TFA doesn't give any examples of this, only that the Debian project picked it for packages, presumably because it creates smaller files than gzip and bzip2. And apt packages are arguably very ephemeral, if a compressed file turned out bad or bitrotted before culling from the repo it's likely to be discovered and fixed easily.
I work in e-archiving, with public sector clients, where we actually do long-term storage of whatever data format. Never seen xz used there.
Does e-archiving use bzip2? I think xz was mostly advertised (and successfully adopted) as a substitute for bzip2, because bzip2 was even slower than xz.
I can see why the lzip author frustrated at the xz format, especially given that there are so many checksums and paddings around, but the lzip format is the opposite extreme.
7-zip already had a concept of multiple filters which contribute to its efficiency, and the underlying design of xz does capture them without much complication. For example, filters in the original 7-zip format (or "codecs") can have both multiple input and output streams [1]. This makes less sense for a single file compressor and xz carefully avoided them. The main problem with the xz format is not its concept but more about its concrete implementation: you don't need extensibility, you only need agility.
In comparison, lzip is too minimal. It might be technically agile by its version field, but it wouldn't if you do nothing and claim that you are open to any addition. It is not hard to pick some filters and mandate only most useful combinations of filters. The stream could have been periodically interrupted to give an early chance to detect errors before the member footer. (Unless lzip natively produces a multimember file even for a single input, which is AFAIK not the case.) The lzip author claims that a corruption in the compressed data can be detected from the decompression process, but that would mean too much redundancy in the compressed data, so this claim has been clearly misguided. And what the heck is that dictionary size coding? Compressed formats frequently make use of exponent-mantissa encodings but I have never seen an encoding where the mantissa is subtracted.
Of course, both should be avoided at this point because zstd is fast and efficient enough. Also, the file format for zstd is better than both in my opinion.
(I've posted the same comment in the older thread, and I also posted my summary of all three file formats so that you can feel what I'm talking about: https://news.ycombinator.com/item?id=39873112)
afaik, 7-zip filters can't have multiple inputs (at the encoding stage).
multiple outputs are necessary for filters that output multiple independent data streams such as bcj2. and they are equally useful for archivers and compressors.
(I'm author of freearc, another archiver software, and multiple compression algos)
PS: thank you for format comparison, it would be great to put xz format description onto its Wikipedia page. I already used you description to understand why attackers added 8 "random" bytes to one of their scripts - probably to "fix" crc-64 value.
[+] [-] st_goliath|1 year ago|reply
2024: https://news.ycombinator.com/item?id=39868810
2022: https://news.ycombinator.com/item?id=32210438
2019: https://news.ycombinator.com/item?id=20103255
2018: https://news.ycombinator.com/item?id=16884832
2016: https://news.ycombinator.com/item?id=12768425
It is utterly unrelated to the ongoing issue (which I'm 100% sure is the reason this is on the front page for a second time within 12 hours right now).
The author is also the developer of lzip, while it does have some valid points, one should apply a necessary level of skepticism ("Competitors product is really bad and unfit", says vendor).
[+] [-] Lockal|1 year ago|reply
We have a single massively used implementation (also known as reference), which does not even follow xz format specification. It is autohell-based, which allowed m4 injection. In other places it is CMake-based, but uses autohell-based code, inserting hidden dots where they could just use https://cmake.org/cmake/help/latest/module/CheckSymbolExists...
"Poor design" is still there, as a core of library. It is as cool, 64-bit square roots and 128-bit divides in Bcachefs[1]
It is as cool as lbzip2[2] with their:
Spoiler: this lbzip2 code produces corrupted files in some cases, should we care if it is a backdoor? Or as usual, disable optimizations, disable valgrind, disable fuzzers and say that everything is ok?[1] https://www.phoronix.com/news/Linux-6.9-Bcachefs-Attempt
[2] https://github.com/kjn/lbzip2/blob/b6dc48a7b9bfe6b340ed1f6d7...
[+] [-] tex0|1 year ago|reply
As if the compression rate increase at medium levels were so significant over bzip2 or grip.
[+] [-] mort96|1 year ago|reply
These are alright criticisms of the xz format, but nothing which would make me wary of using xz for long term archival.
[+] [-] livrem|1 year ago|reply
[+] [-] lifthrasiir|1 year ago|reply
[+] [-] cess11|1 year ago|reply
Who's using xz as a long-term archiving or general purpose compression format? TFA doesn't give any examples of this, only that the Debian project picked it for packages, presumably because it creates smaller files than gzip and bzip2. And apt packages are arguably very ephemeral, if a compressed file turned out bad or bitrotted before culling from the repo it's likely to be discovered and fixed easily.
I work in e-archiving, with public sector clients, where we actually do long-term storage of whatever data format. Never seen xz used there.
[+] [-] lifthrasiir|1 year ago|reply
[+] [-] tutfbhuf|1 year ago|reply
[+] [-] user20180120|1 year ago|reply
[+] [-] JoshTriplett|1 year ago|reply
[+] [-] m463|1 year ago|reply
(usually compression format has it though)
[+] [-] lifthrasiir|1 year ago|reply
7-zip already had a concept of multiple filters which contribute to its efficiency, and the underlying design of xz does capture them without much complication. For example, filters in the original 7-zip format (or "codecs") can have both multiple input and output streams [1]. This makes less sense for a single file compressor and xz carefully avoided them. The main problem with the xz format is not its concept but more about its concrete implementation: you don't need extensibility, you only need agility.
In comparison, lzip is too minimal. It might be technically agile by its version field, but it wouldn't if you do nothing and claim that you are open to any addition. It is not hard to pick some filters and mandate only most useful combinations of filters. The stream could have been periodically interrupted to give an early chance to detect errors before the member footer. (Unless lzip natively produces a multimember file even for a single input, which is AFAIK not the case.) The lzip author claims that a corruption in the compressed data can be detected from the decompression process, but that would mean too much redundancy in the compressed data, so this claim has been clearly misguided. And what the heck is that dictionary size coding? Compressed formats frequently make use of exponent-mantissa encodings but I have never seen an encoding where the mantissa is subtracted.
Of course, both should be avoided at this point because zstd is fast and efficient enough. Also, the file format for zstd is better than both in my opinion.
(I've posted the same comment in the older thread, and I also posted my summary of all three file formats so that you can feel what I'm talking about: https://news.ycombinator.com/item?id=39873112)
[1] https://py7zr.readthedocs.io/en/latest/archive_format.html#c...
[+] [-] Bulat_Ziganshin|1 year ago|reply
multiple outputs are necessary for filters that output multiple independent data streams such as bcj2. and they are equally useful for archivers and compressors.
(I'm author of freearc, another archiver software, and multiple compression algos)
PS: thank you for format comparison, it would be great to put xz format description onto its Wikipedia page. I already used you description to understand why attackers added 8 "random" bytes to one of their scripts - probably to "fix" crc-64 value.
[+] [-] tgz|1 year ago|reply
[deleted]