top | item 43174955

(no title)

amzin | 1 year ago

Is there a FS that keeps only diffs in clone files? It would be neat

discuss

order

rappatic|1 year ago

I wondered that too.

If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.

But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.

Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.

It's certainly an interesting concept but might be more trouble than it's worth.

abrookewood|1 year ago

ZFS does this by de-duplicating at the block level, not the file level. It means you can do what you want without needing to keep track of a chain of differences between files. Note that de-duplication on ZFS has had issues in the past, so there is definitely a trade-off. A newer version of de-duplication sounds interesting, but I don't have any experience with it: https://www.truenas.com/docs/references/zfsdeduplication/

UltraSane|1 year ago

VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.

https://www.vastdata.com/blog/breaking-data-reduction-trade-...

OnlyMortal|1 year ago

That’s very interesting. Typically a Rabin fingerprint is used to identify identical chunks of data.

Identifying similar blocks and, maybe sub-rechunking isn’t something I’ve ever considered.

abrookewood|1 year ago

ZFS: "The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space." (emphasis added)

https://www.truenas.com/docs/references/zfsdeduplication/

jonhohle|1 year ago

APFS shares blocks so only blocks that changed are no longer shared. Since a block is the smallest atomic unit (except maybe an inode) in a FS, that’s the best level of granularity to expect.

the8472|1 year ago

With extent-based filesystems you can clone extents and then overwrite one extent and only that becomes unshared.