top | item 45434599

(no title)

EdSchouten | 5 months ago

I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

https://github.com/buildbarn/go-cdc

discuss

order

Scaevolus|5 months ago

This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...

Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)

EdSchouten|5 months ago

Yeah, GEAR hashing is simple enough that I haven't considered using anything else.

Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.

rokkamokka|5 months ago

What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?

EdSchouten|5 months ago

In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.

If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.

quotemstr|5 months ago

I wonder whether there's a role for AI here.

(Please don't hurt me.)

AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).

Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.

EdSchouten|5 months ago

Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.

In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.