Zstandard RFC 8878

As someone on the Zstd team, I'm always happy to see it on HN! I'm curious though what motivates the submission?

Probably its use in elfshaker[0].

[0] https://news.ycombinator.com/item?id=29277779

kzrdude|4 years ago

Zstd is always interesting.

For many applications (file formats), ubiquity is important, so it would be fun if zstd becomes ubiquitous and can be relied on to be available. Let's say for example in future versions of HDF (HDF5 or later).

thriftwy|4 years ago

Zstandard has very cool dictionary training feature, which allows to keep a separate dictionary and have a 50% ratio compression on very small (~100b) but repetitive data such as database records.

Taywee|4 years ago

I've always thought it could be pretty cool to leverage that for transparent filesystem compression.

For context, filesystem compression usually compresses blocks of data individually (for instance, every 64K block of a file will be individually compressed, and when you modify a file in the middle, that block needs to be recompressed entirely). This is usually good enough, and it has some pretty cool properties, like being able to have only compressable parts of a file compressed, or turning on compression on a file and having only new and rewritten blocks get compressed. Because of Zstd's separated dictionary, it seems like it could be feasible to instead store the dictionary in the file's inode and compress the blocks with that dictionary (recomputing the dictionary and recompressing existing blocks when the file allocates 10 4K blocks and then again at 100 blocks, perhaps).

I wonder what different properties such a compression scheme would have. I imagine it would be able to achieve a much smaller size due to not having to store a dictionary with each compressed block. A downside would be that a corrupted or overwritten inode would render the file completely unrecoverable, where current compression schemes allow blocks to be individually decompressed. Another downside is that files can't be partially compressible, only entirely.

vlovich123|4 years ago

Is there a reason zstd isn’t popular for HTTP and only brotli and gzip see adoption?

zinekeller|4 years ago

Because Facebook doesn't have a browser.

(But seriously, Mozilla engineers have warned the Chrome team that they are too rush with the inclusion of Brotli, since that compression wars are heating up. They still proceeded though, which is unsurprising.)

lifthrasiir|4 years ago

Brotli was arguably designed specifically for the web, because it was originally used in the WOFF2 font format and also had a large amount of preset dictionary collected from the web (including HTML, CSS and JS fragments). Zstandard had no such consideration, and while it could be as efficient as Brotli with a correct dictionary it does have a less merit compared to Brotli in the web context.

duskwuff|4 years ago

Brotli has some pretty wild optimizations for web content, including a gigantic (~120 KB) predefined dictionary packed full of sequences commonly used in HTML/JS/CSS content. This gives it a huge advantage on small text files.

jhgb|4 years ago

I assume it's because it's very new? That would seem like an obvious explanation.

jeffbee|4 years ago

Does this mean the Zstd magic number is now cast in stone?

cornstalks|4 years ago

It's an Informational RFC, not a Standards Track RFC (https://en.wikipedia.org/wiki/Request_for_Comments#Status). That said, I think the magic number is pretty firmly established.

lifthrasiir|4 years ago

You may have mistaken Brotli (whose file format has no magic number and prevents an easy identification) with Zstandard (whose file format does have defined magic numbers 28 B5 2F FD or [50-5F] 2A 4D 18).

stouset|4 years ago

Can you shed some light on why this might be something of concern?

wmf|4 years ago

The file format was finalized years ago, so yes.

kzrdude|4 years ago

by the way, zlib-ng also seems interesting. In the sense that it's cleaning up and improving a very aged library https://github.com/Dead2/zlib-ng

ggm|4 years ago

It's said to be a good fit for ZFS. I tend to lz4 because its baked into the older systems I use, but it may be at a point where my default should be zstd.

bz2/gz still predominates for compressed objects in filestore from what I can see.

xoa|4 years ago

>It's said to be a good fit for ZFS. I tend to lz4 because its baked into the older systems I use, but it may be at a point where my default should be zstd.

It is, and you should definitely at least give it a look. I posted a comment mentioning it the other day in the OpenZFS 2.0 thread [0], and it also came up recently on HN in a thread linked there, but there are some interesting performance graphs comparing different standards in the github PR for zstd in ZFS [1]. LZ4 still has its place IMO, ZFS is not run nor good for exclusively heavier metal, people use it to good effect on the likes of things like RPis as well. Sometimes CPU cycles is still the limiter or every last one is needed elsewhere. I also think it matters a lot less on spinning rust, where $/TB tends to be so much lower. How much one gets out of it also is influenced by application, general NAS with larger record size is going to see different gains vs a database. But with even vaguely modern desktop CPUs (and their surfeit of cores) and SSDs, particularly in network storage dedicated devices, an extra 10-30% even is worth a lot and there's usually plenty of CPU to throw at it. Even more so if primary usage is limited to only a 10-50 Gbps connection.

As always though probably best if you can benchmark it with your own stuff and play around a bit pulling different levers. ZFS is nice that way too since it's so easy to create a bunch of different test FS at the same time.

----

0: https://news.ycombinator.com/item?id=29268907

1: https://github.com/openzfs/zfs/pull/9735#issuecomment-570082...

unknown|4 years ago

[deleted]

buryat|4 years ago

i forgive facebook all their abuses just because they gave us zstd

oofbey|4 years ago

I think you should read more about Facebook. Try e.g. the Damien Collins email dump, and read about how their android app tricked people into letting it record all phone call and text message records, knowing full well users would hate it if they found out.

Clearly they produce good technology. But the company is morally bankrupt.

kzrdude|4 years ago

You could read about lz4 and then later zstd on http://fastcompression.blogspot.com/ long before he joined facebook.

metafex|4 years ago

they didn't though. zstd has been around even before the main dev joined fb, i distinctly remember it being under the persons personal github name.

prirun|4 years ago

Yann Collet developed Zstandard first, on his own, then Facebook hired him and Zstandard went along with him.

m0zg|4 years ago

Zstd is an amazing bit of work and all I ever use for data compression nowadays (or LZ4 when speed is even more critical). Several times the compression/decompression speed of gzip, approximately the same compression ratio with default settings.

It's also supported by tar in recent Linux distros, if zstd is installed, so "tar acf blah.tar.zst *" works fine, and "tar xf blah.tar.zst" works automatically as well. Give it a try, folks, and retire gzip shortly afterwards.

nigeltao|4 years ago

> Several times the compression/decompression speed of gzip

Just be careful that you're comparing against the best implementation of gzip. One recent re-implementation of zcat was 3.1x faster than /bin/zcat (and the CRC-32 implementation within was 7.3x faster than /bin/crc32). Both programs decode exactly the same file format. They're just different implementations. For details, see: https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...

mjevans|4 years ago

I get why someone might want to avoid .zstd ; but that is the short name offered for humans.

Was .zs not sufficient if a file format ending in 'std' is so abhorrent?

erichocean|4 years ago

Does Zstandard still have the junk Facebook license attached to it?

lifthrasiir|4 years ago

Not since 1.3.1.

43 comments