If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.
But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.
Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.
It's certainly an interesting concept but might be more trouble than it's worth.
ZFS does this by de-duplicating at the block level, not the file level. It means you can do what you want without needing to keep track of a chain of differences between files. Note that de-duplication on ZFS has had issues in the past, so there is definitely a trade-off. A newer version of de-duplication sounds interesting, but I don't have any experience with it: https://www.truenas.com/docs/references/zfsdeduplication/
VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.
ZFS: "The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space." (emphasis added)
APFS shares blocks so only blocks that changed are no longer shared. Since a block is the smallest atomic unit (except maybe an inode) in a FS, that’s the best level of granularity to expect.
rappatic|1 year ago
If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.
But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.
Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.
It's certainly an interesting concept but might be more trouble than it's worth.
abrookewood|1 year ago
UltraSane|1 year ago
https://www.vastdata.com/blog/breaking-data-reduction-trade-...
OnlyMortal|1 year ago
Identifying similar blocks and, maybe sub-rechunking isn’t something I’ve ever considered.
abrookewood|1 year ago
https://www.truenas.com/docs/references/zfsdeduplication/
p_ing|1 year ago
https://www.microsoft.com/en-us/download/details.aspx?id=397...
alwillis|1 year ago
jonhohle|1 year ago
the8472|1 year ago