It was really hard to resist spilling the beans about OpenZL on this recent HN post about compressing genomic sequence data [0]. It's a great example of the really simple transformations you can perform on data that can unlock significant compression improvements. OpenZL can perform that transformation internally (quite easily with SDDL!).
That post immediately came to my mind too! Do you maybe have a comparison to share with respect to the specialized compressor mentioned in the OP there?
> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.
I'd love to see some benchmarks for this on some common genomic formats (fa, fq, sam, vcf). Will be doubly interesting to see its applicability to nanopore data - lots of useful data is lost because storing FAST5/POD5 is a pain.
On a semi-related note, there was recently a discussion[1] on the F3 file format, which also allows for format-aware compression by embedding the decompressor code as WASM. Though the main motivation for F3 was future compatibility, it does allow for bespoke compression algorithms.
This takes a very different approach, and wouldn't require a full WASM runtime. Though it does have the SDDL compiler and runtime, though I assume it's a lighter dependency.
As someone seriously trying to develop a compressed archive format with WebAssembly, sandboxing is actually easy and that's indeed why WebAssembly was chosen. The real problem is determinism, which WebAssembly does technically support but actual implementations may vary significantly. And even when WebAssembly can be made fully deterministic, function calls made to those WebAssembly modules may still be undeterministic! I tried very hard to avoid such pitfalls in my design, and it is entirely reasonable to avoid WebAssembly due to these issues.
Specialization for file formats is not novel (e.g. 7-Zip uses BCJ2 prefiltering to convert x86 opcodes from absolute to relative JMP instructions), nor is embedding specialized decoder bytecode in the archive (e.g. ZPAQ did this and won a lot of Matt Mahoney's benchmarks) but i think OpenZL's execution here, along with the data description and training system, is really fantastic.
So, as I understand, you describe the structure of your data in an SDL and then the compressor can plan a strategy on how to best compress the various part of the data ?
Honestly looks incredible. Could be amazing to provide a general framework for compressing custom format.
Exactly! SDDL [0] provides a toolkit to do this all with no-code, but today is pretty limited. We will be expanding its feature set, but in the meantime you can also write code in C++ or Python to parse your format. And this code is compression side only, so the decompressor is agnostic to your format.
Yeah, backend compression in columnar data formats is a natural fit for OpenZL. Knowing the data it is compressing is numeric, e.g. a column of i64 or float, allows for immediate wins over Zstandard.
One of the mentioned examples sounds like the compressor is taking advantage of the SDDL by treating row-oriented data as stripes of column-oriented data, and then compressing that. This makes me curious - for data that’s already column-oriented like Parquet, what’s the advantage of OpenZL over zstd?
SDDL (and the front-end task of reshaping data in general) is only one component of OpenZL. Once you have the streams, you can do all sorts of transformations to them that Zstd doesn't.
Ooh, thanks for mentioning these! I wasn't aware of the existence of these tools but yes it seems very possible that you could transform these other spec formats into SDDL descriptions. I'll check them out.
You'd have to tell OpenZL what your format looks like by writing a tokenizer for it, and annotating which parts are which. We aim to make this easier with SDDL [0], but today is not powerful enough to parse JSON. However, you can do that in C++ or Python.
Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.
However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.
We developed OpenZL initially for our own consumption at Meta. More recently we've been putting a lot of effort into making this a usable tool for people who, you know, didn't develop OpenZL. Your feedback is welcome!
On the other hand the default CSV profile didn't seem that great either, the CSV file was 349 MB and it compressed it down to 119MB while a ZIP file of the CSV is 105MB.
Any plans to make it so one format can reference another format? Sometimes data of one type occurs within another format, especially with archive files, media container files, and disk images.
So, for example, suppose someone adds a JSON format to OpenZL. Then someone else adds a tar format. While parsing a tar file, if it contains foo.json, there could be some way of saying to OpenZL, "The next 1234 bytes are in the JSON format." (Maybe OpenZL's frames would allow making context shifts like this?)
A related thing that would also be nice is non-contiguous data. Some formats include another format but break up the inner data into blocks. For example, a network capture of a TCP stream would include TCP/IP headers, but the payloads of all the packets together constitute another stream of data in a certain format. (This might get memory intensive, though, since there's multiplexing, so you may need to maintain many streams/contexts.)
The OpenZL core supports arbitrary composition of graphs. So you can do this now via the compressor construction APIs. We just have to figure out how to make it easy to do.
it reminds me of the EXI compression for XML that can be very optimized with a XSD Schema with a schema aware compression, that also use the schema graph for optimal compression :
https://www.w3.org/TR/exi-primer/
This method reminds me of how deep learning models get compressed for deployment on accelerators. You take advantage of different redundancies of different data structures and compress each of them using a unique method.
Specifically the dictionary + delta-encoded + huffman'd index lists method mentioned in TFA, is commonly used for compressing weights. Weights tend to be sparse, but clustered, meaning most offsets are small numbers with the occasional jump, which is great for huffman.
We actually worked on a demo WAV compressor a while back. We are currently missing codecs to run the types of predictors that FLAC runs. We expect to add this kind of functionality in the future, in a generic way that isn't specific to audio, and can be used across a variety of domains.
But, generally we wouldn't expect to generally beat FLAC. But, be able to offer specialized compressors for many types of data that previously weren't important enough to spawn a whole field of specialized compressors, by significantly lowering the bar for entry.
However, OpenZL is different in that you need to tell the compressor how to compress your data. The CLI tool has a few builtin "profiles" which you can specify with the `--profile` argument. E.g. csv, parquet, or le-u64. They can be listed with `./zli list-profiles`.
You can always use the `serial` profile, but because you haven't told OpenZL anything about your data, it will just use Zstandard under the hood. Training can learn a compressor, but it won't be able to learn a format like `.tar` today.
If you have raw numeric data you want to throw at it, or Parquets or large CSV files, thats where I would expect OpenZL to perform really well.
Are you thinking about adding stream support? I.e something along the lines of i) build up efficient vocabulary up front for the whole data and then ii) compress by chunks, so it can be decompressed by chunks as well. This is important for seeking in data and stream processing.
Yes, definitely! Chunking support is currently in development. Streaming and seeking and so on are features we will certainly pursue as we mature towards an eventual v1.0.0.
I am trying to compress a file which has size lot larger than 2 GB , but i am getting error
Unhandled Exception:
Chunking support is required for compressing inputs larger than 2 GiB.
Can't we compress big files with OpenZL , can't find about this error in any documentation
You could have an LLM generate the SDDL description [0] for you, or even have it write a C++ or Python tokenizer. If compression succeeds, then it is guaranteed to round trip, as the LLM-generated logic lives only on the compression side, and the decompressor is agnostic to it.
It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.
We left it out of the paper because it is an implementation detail that is absolutely going to change as we evolve the format. This is the function that actually does it [0], but there really isn't anything special here. There are some bit-packing tricks to save some bits, but nothing crazy.
Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.
I've recently been wondering: could you re-compress gzip to a better compression format, while keeping all instructions that would let you recover a byte-exact copy of the original file? I often work with huge gzip files and they're a pain to work with, because decompression is slow even with zlib-ng.
precomp/antix/... are tools that can bruteforce the original gzip parameters and let you recreate the byte-identical gzip archive.
The output is something like {precomp header}{gzip parameters}{original uncompressed data} which you can then feed to a stronger compressor.
A major use case is if you have a lot of individually gzipped archives with similar internal content, you can precomp them and then use long-range solid compression over all your archives together for massive space savings.
I may be misunderstanding the question but that should be just decompressing gzip & compressing with something better like zstd (and saving the gzip options to compress it back), however it won't avoid compressing and decompressing gzip.
Is it beneficial for logs compression assuming you log to JSON but you dont know schema upfront?
Im workong on a logs compression tool and Im wondering whether OpenZL fits there
I used to see as magic that the old original compression algorithms worked so well with generic text, without worrying about format, file type, structure or other things that could give hints of additional redundancy.
No, not really. They are both cool but solve different problems. The problem Basis solves is that GPUs don't agree on which compressed texture formats to support in hardware. Basis is a single compressed format that can be transcoded to almost any of the formats GPUs support, which is faster and higher quality than e.g. decoding a JPEG and then re-encoding to a GPU format.
The charts in the "Results With OpenZL" section compare against all levels of zstd, xz, and zlib.
On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.
Congrats on the release. I was wondering what the zstd team is up to lately.
You mentioned something about grid structured data being in the plans - can you give more details?
Have you done experiments with compressing BCn GPU texture formats? They have a peculiar branched structure, with multiple sub formats packed tightly in bitfields of 64- or 128-bit blocks; due to the requirement of fixed ratio and random access by the GPU they still leave some potential compression on the table.
felixhandte|4 months ago
[0] https://news.ycombinator.com/item?id=45223827
perching_aix|4 months ago
> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.
bede|4 months ago
Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)
Gethsemane|4 months ago
felixhandte|4 months ago
magicalhippo|4 months ago
This takes a very different approach, and wouldn't require a full WASM runtime. Though it does have the SDDL compiler and runtime, though I assume it's a lighter dependency.
[1]: https://news.ycombinator.com/item?id=45437759 F3: Open-source data file format for the future [pdf] (125 comments)
lifthrasiir|4 months ago
TiredOfLife|4 months ago
snapplebobapple|4 months ago
nunobrito|4 months ago
When the data container is understood, the deduplication is far more efficient because now it is targeted.
Licensed as BSD-3-Clause, solid C++ implementation, well documented.
Will be looking forward to see new developments as more file formats are contributed.
mappu|4 months ago
maeln|4 months ago
Honestly looks incredible. Could be amazing to provide a general framework for compressing custom format.
terrelln|4 months ago
[0] https://openzl.org/api/c/graphs/sddl/
zzulus|4 months ago
terrelln|4 months ago
squirrellous|4 months ago
felixhandte|4 months ago
unknown|4 months ago
[deleted]
viraptor|4 months ago
felixhandte|4 months ago
dist-epoch|4 months ago
Unclear if this has enough "structure" for OpenZL.
terrelln|4 months ago
Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.
However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.
[0] https://openzl.org/api/c/graphs/sddl/
wmf|4 months ago
kingstnap|4 months ago
felixhandte|4 months ago
We developed OpenZL initially for our own consumption at Meta. More recently we've been putting a lot of effort into making this a usable tool for people who, you know, didn't develop OpenZL. Your feedback is welcome!
ionelaipatioaei|4 months ago
``` src/openzl/codecs/dispatch_string/encode_dispatch_string_binding.c:74: EI_dispatch_string: splitting 48000001 strings into 14 outputs OpenZL Library Exception: OpenZL error code: 55 OpenZL error string: Input does not respect conditions for this node OpenZL error context: Code: Input does not respect conditions for this node Message: Check `eltWidth != 2' failed where: lhs = (unsigned long) 4 rhs = (unsigned long) 2
Graph ID: 5 Stack Trace: #0 doEntropyConversion (src/openzl/codecs/entropy/encode_entropy_binding.c:788): Check `eltWidth != 2' failed where: lhs = (unsigned long) 4 rhs = (unsigned long) 2
#1 EI_entropyDynamicGraph (src/openzl/codecs/entropy/encode_entropy_binding.c:860): Forwarding error: #2 CCTX_runGraph_internal (src/openzl/compress/cctx.c:770): Forwarding error: #3 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #4 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #5 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #6 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #7 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #8 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #9 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #10 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #11 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #12 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #13 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #14 CCTX_startCompression (src/openzl/compress/cctx.c:1276): Forwarding error: #15 CCTX_compressInputs_withGraphSet_stage2 (src/openzl/compress/compress2.c:116): Forwarding error: ```
On the other hand the default CSV profile didn't seem that great either, the CSV file was 349 MB and it compressed it down to 119MB while a ZIP file of the CSV is 105MB.
adrianmonk|4 months ago
Any plans to make it so one format can reference another format? Sometimes data of one type occurs within another format, especially with archive files, media container files, and disk images.
So, for example, suppose someone adds a JSON format to OpenZL. Then someone else adds a tar format. While parsing a tar file, if it contains foo.json, there could be some way of saying to OpenZL, "The next 1234 bytes are in the JSON format." (Maybe OpenZL's frames would allow making context shifts like this?)
A related thing that would also be nice is non-contiguous data. Some formats include another format but break up the inner data into blocks. For example, a network capture of a TCP stream would include TCP/IP headers, but the payloads of all the packets together constitute another stream of data in a certain format. (This might get memory intensive, though, since there's multiplexing, so you may need to maintain many streams/contexts.)
felixhandte|4 months ago
hokkos|4 months ago
ohnoesjmr|4 months ago
porridgeraisin|4 months ago
Specifically the dictionary + delta-encoded + huffman'd index lists method mentioned in TFA, is commonly used for compressing weights. Weights tend to be sparse, but clustered, meaning most offsets are small numbers with the occasional jump, which is great for huffman.
xyzzy3000|4 months ago
p1mrx|4 months ago
So OpenZL is significantly better than zstd, but worse than flac.
terrelln|4 months ago
We actually worked on a demo WAV compressor a while back. We are currently missing codecs to run the types of predictors that FLAC runs. We expect to add this kind of functionality in the future, in a generic way that isn't specific to audio, and can be used across a variety of domains.
But, generally we wouldn't expect to generally beat FLAC. But, be able to offer specialized compressors for many types of data that previously weren't important enough to spawn a whole field of specialized compressors, by significantly lowering the bar for entry.
altcognito|4 months ago
fitzn|4 months ago
I am pumped to see this. Thanks for sharing.
bigwheels|4 months ago
Edit: @terrelln Got it, thank you!
terrelln|4 months ago
https://openzl.org/getting-started/quick-start/
However, OpenZL is different in that you need to tell the compressor how to compress your data. The CLI tool has a few builtin "profiles" which you can specify with the `--profile` argument. E.g. csv, parquet, or le-u64. They can be listed with `./zli list-profiles`.
You can always use the `serial` profile, but because you haven't told OpenZL anything about your data, it will just use Zstandard under the hood. Training can learn a compressor, but it won't be able to learn a format like `.tar` today.
If you have raw numeric data you want to throw at it, or Parquets or large CSV files, thats where I would expect OpenZL to perform really well.
michalsustr|4 months ago
felixhandte|4 months ago
yinnovator|4 months ago
jmakov|4 months ago
terrelln|4 months ago
It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.
[0] https://openzl.org/api/c/graphs/sddl/
yubblegum|4 months ago
terrelln|4 months ago
Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.
[0] https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...
d33|4 months ago
mappu|4 months ago
The output is something like {precomp header}{gzip parameters}{original uncompressed data} which you can then feed to a stronger compressor.
A major use case is if you have a lot of individually gzipped archives with similar internal content, you can precomp them and then use long-range solid compression over all your archives together for massive space savings.
o11c|4 months ago
artemisart|4 months ago
piterrro|4 months ago
[0] https://logdy.dev/logdy-pro
waustin|4 months ago
gmuslera|4 months ago
wmf|4 months ago
ttoinou|4 months ago
modeless|4 months ago
jmakov|4 months ago
terrelln|4 months ago
On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.
eyegor|4 months ago
telendram|4 months ago
TheMode|4 months ago
Havoc|4 months ago
Are the compression speed chart all like-for-like in terms of what is hw accelerated vs not?
felixhandte|4 months ago
stepanhruda|4 months ago
felixhandte|4 months ago
Code: https://github.com/facebook/openzl
Documentation: https://openzl.org/
White Paper: https://arxiv.org/abs/2510.03203
unsigner|4 months ago
You mentioned something about grid structured data being in the plans - can you give more details?
Have you done experiments with compressing BCn GPU texture formats? They have a peculiar branched structure, with multiple sub formats packed tightly in bitfields of 64- or 128-bit blocks; due to the requirement of fixed ratio and random access by the GPU they still leave some potential compression on the table.
dang|4 months ago
goldforever|4 months ago
[deleted]
fnands|4 months ago
fnands|4 months ago