top | item 24405250

(no title)

aurelian15 | 5 years ago

There are ways around this. See "content-aware chunking", e.g. implemented using rolling hashes [1]. This is for example what rsync does.

The idea is to make blocks (slightly) variable in size. Block boundaries are determined based on a limited window of preceding bytes. This way a change in one location will only have a limited impact on the following blocks.

[1] https://en.wikipedia.org/wiki/Rolling_hash

discuss

order

octoberfranklin|5 years ago

Rolling hashing is really only useful for finding nonaligned duplicates.

There isn't a way to advertise some "rolling hash value" in a way that allows other people with a differently-aligned copy to notice that you and them have some duplicated byte ranges.

Rolling hashes only work when one person (or two people engaged in a conversation, like rsync) already has both copies.

btschaegg|5 years ago

I think you misunderstood how the rolling hash is used in this context. It's not used to address a chunk; you'd use a plain old cryptographic hash function for that.

The rolling hash is used to find the chunk boundary: Hash a window before every byte (which is cheap with a rolling hash) and compare it against a defined bit mask. For example: Check if the first 20 bytes are zero. If so, you'd get chunks with about 2^20 bytes (1 MiB) average length.

As a good explanation, I'd encourage you to look at borgbackup's internals documentation: https://borgbackup.readthedocs.io/en/stable/internals.html