I used a similar trick to stuff 14 Linux distros onto a 6 GiB sized SD card image using btrfs filesystem. The distros share a lot of common files, so this works well to save space. (btrfs also supports the data block sharing CoW feature as APFS)
That's really cool. There are a bunch of tools that will let you symlink or hard link deduplicate files, but being able to do block-level dedupes simply by leaning on the filesystem is nice.
It sometimes feels like games are made to thwart this type of thing. They often use packfiles, basically filesystems within files optimized to look up assets quickly. Also perhaps they allowed optimized data layout from when consoles had slow spinning hard drives. The upshot is that a tiny patch inserting a line of code in a script may offset hundreds of megabytes of other data in the packfiles, causing the block hashes to no longer match up. Do any filesystems model inserts in some way? I'm pretty sure Steam updates can handle situations like that. I frequently see updates which download a tiny amount (kilobytes) but write a huge amount to disk (gigabytes), and I can't think of any other cause. (Assuming developers aren't using hilariously un-compressed assets).
Yes, Steamworks encourages developers to align assets to certain boundaries to help ensure efficient patching later, see the documentation here: https://partner.steamgames.com/doc/sdk/uploading
Some games just add an update pak that overloads the changed files, not changing the original pak file. So the installation size of such games might grow a lot on updates, wasting a lot of space by keeping the old files, but the files are immutable at least and rollbacks of updates are theoretically possible.
Steam resisted any patching mechanism for updates for a long time. Even when they revamped the update system [0] they instead opted to split updates into separate files for their own pack format [1]. They seem to have relented though - I guess other developers were not as motivated to optimize their asset layout for Steam when Valve is the one paying for the bandwidth.
Coincidentally, I just saw that the "USENIX Test of Time Award" for 2023 was an analysis for deduplication purposes of real-world files. I found Figure 1 particularly interesting in that partial deduplication didn't save much over whole-file based deduplication in practice.
Very cool and probably a fun exercise, but I would probably put the data on a ZFS volume with dedupe instead, which from reading this implementation is pretty much the same thing to my layman’s perspective. I could also add compression to the same dataset.
My recent 'copy-on-write' foot gun was using ZFS in ubuntu. There seems to be some automatic config that makes zfs snapshots each time you apt install something. This lead to wondering why on earth no matter how many files I deleted my disk usage was not reducing from 100%. All those deleted files still existed in the sanpshots. coupled with some bug that makes the snaps shots report their size as much smaller than they really were.
That's neat! In retrospect, did the database make the problem more tractable from an algorithmic or practical standpoint, or was it mostly just using the tools you're familiar with? If I were to approach the same problem, I likely would have tried to keep all the data in RAM and serialize it out to a file for incremental runs.
IME a database does not algorithmically improve anything, but it does make at your disposal some very efficient data structures that are definitely nontrivial to replicate. This is especially the case if you have to work with datasets that don't fit in memory!
There are many of other products which implement "keep track of partial file contents to deduplicate", for example casync borg and restic.
None of them use sqlite for metadata storage, which kinda makes sense - in author's schema, each block uses (fileid, block_index, hash) tuple, which is 2/3 wasted space, as fileids are mostly same (if files are big) and block_index is always increasing. A custom data structure would be both faster and more efficient.
I have used SQLite to deduplicate csvs when it couldn’t all fit into memory.
I imagine Dask would be a better tool for this job but I wasn’t familiar with it at the time and it did allow me to deduplicate using Python and no external dependencies.
"There is no file path, because macOS lets you look up files by id.
The hash values are cryptographic hashes truncated to 64 bits and
reinterpreted as integers."
Is the author implying that APFS or HFS uses this method to calculate the file ID? I am unable to find any information regarding this. From what I understand, w.r.t APFS, the file ID is a combination of the inode OID and genID.
The hash values are part of his implementation and don’t have anything to do with APFS or HFS. The sentence after your quote says, “I picked the BLAKE3 hash function…” indicating it was something he chose, not a FS implementation detail.
With a b-tree you can do even better. Instead of hashing entire files ahead of time, you can avoid hashing more than the minimum prefix required to conclude a file is unique, by building the b-tree lazily. I don't think it can be done with sqlite though, as it requires hashing the files while traversing the b-tree.
The other day I found out that not only ZFS and Btrfs have a dedup feature, but also XFS (which I use on my home partition) supports a feature called reflinks aka block sharing. So I threw duperemove at ~/.virtualenvs and ~/.cache/bazel and freed up a good 15GB of disk space. Good stuff.
PostgreSQL is Turing-complete, as proven by implementations of a cyclic tag system[0] and Turing machine[1]. The Mandelbrot set[2] and travelling-salesman problem[3] have also been implemented in it.
Transact-SQL is also Turing-complete, as proven by a Brainfuck implementation[4].
With that, you can theoretically compute anything :).
Oh sweet. I've definitely used sqlite (in a zsh script) for file deduplication. Very simple, mostly just rows consisting of paths and file content hash.
But partial file deduplication is something else...
I appreciate seeing this article that I wrote 4 years ago resurface :) Happy to answer any questions about it or the tech behind it.
We also use a SQLite database, and in a manner similar to the OP’s article. We use it to track which content-defined chunks are at which offsets in which files on disk. We deduplicate those chunks on the CDN so you have to download less data when updating the game, but on disk we need to recreate the files as originally uploaded because that’s how games load them.
We’ve expanded the use of this tech quite a bit since the article. For example, my team has started using the patcher internally to manage our 2-million-file vendored code and binaries repo, as a replacement for Git LFS of sorts. Cloning the repo from scratch became 10 times faster, and so did switching branches, especially for remote devs, since the files are served through CloudFront.
Some of the more interesting work we’ve done recently has been optimizing the core content-defined chunking algorithm. The GearHash rolling hash algorithm from FastCDC is decently fast, but we were able to improve chunking speeds by 5x with vectorized implementations and using AES-NI instructions in a clever way to avoid the GearHash table lookups. It can do 11GB/sec per core now using AVX-512 instructions.
Another thing we did recently was build a tool for repacking Unreal Engine Pak files so game assets maintain a consistent layout across versions, similar to what garaetjjte mentioned in a comment above about UnityFS files. This reduced the disk I/O needed to update VALORANT by over 90% and cut update times by half for players. The combination of content-defined chunking with tooling to make game data files friendlier to differential patching can make a huge difference.
This is quite amazing. I actually built something even more crazy with Sqlite where I broke up the files into parts and then made hashes of the parts, so in total it would use less space than the sum of the files' space. i used this for a similarity engine where I would try to see how similar different files were
Slightly off topic -> there was a project I seen on gh that claimed to support system administration using a relational tables. Something like everything in SQL. I thought it might be a cool idea.
[+] [-] megous|3 years ago|reply
https://xnux.eu/p-boot-demo/
I had to write a custom tool to do it efficiently during image creation.
[+] [-] tarasglek|3 years ago|reply
[+] [-] itsthecourier|3 years ago|reply
[+] [-] mastax|3 years ago|reply
It sometimes feels like games are made to thwart this type of thing. They often use packfiles, basically filesystems within files optimized to look up assets quickly. Also perhaps they allowed optimized data layout from when consoles had slow spinning hard drives. The upshot is that a tiny patch inserting a line of code in a script may offset hundreds of megabytes of other data in the packfiles, causing the block hashes to no longer match up. Do any filesystems model inserts in some way? I'm pretty sure Steam updates can handle situations like that. I frequently see updates which download a tiny amount (kilobytes) but write a huge amount to disk (gigabytes), and I can't think of any other cause. (Assuming developers aren't using hilariously un-compressed assets).
[+] [-] mikepurvis|3 years ago|reply
[+] [-] panzi|3 years ago|reply
[+] [-] account42|3 years ago|reply
[0] https://developer.valvesoftware.com/wiki/SteamPipe
[1] https://developer.valvesoftware.com/wiki/VPK
[+] [-] beecafe|3 years ago|reply
[+] [-] evmar|3 years ago|reply
https://www.usenix.org/conferences/test-of-time-awards
[+] [-] emmelaich|3 years ago|reply
[+] [-] hultner|3 years ago|reply
[+] [-] traceroute66|3 years ago|reply
But doesn't ZFS dedupe eat RAM for breakfast ? Double-digit GB RAM per TB data IIRC ?
[+] [-] totetsu|3 years ago|reply
[+] [-] tripleo1|3 years ago|reply
[+] [-] sgtnoodle|3 years ago|reply
[+] [-] qsort|3 years ago|reply
An added bonus is that you can use SQL.
[+] [-] theamk|3 years ago|reply
There are many of other products which implement "keep track of partial file contents to deduplicate", for example casync borg and restic.
None of them use sqlite for metadata storage, which kinda makes sense - in author's schema, each block uses (fileid, block_index, hash) tuple, which is 2/3 wasted space, as fileids are mostly same (if files are big) and block_index is always increasing. A custom data structure would be both faster and more efficient.
[+] [-] nonethewiser|3 years ago|reply
I imagine Dask would be a better tool for this job but I wasn’t familiar with it at the time and it did allow me to deduplicate using Python and no external dependencies.
[+] [-] leetbulb|3 years ago|reply
Is the author implying that APFS or HFS uses this method to calculate the file ID? I am unable to find any information regarding this. From what I understand, w.r.t APFS, the file ID is a combination of the inode OID and genID.
[+] [-] jonhohle|3 years ago|reply
[+] [-] cerved|3 years ago|reply
[+] [-] maxyurk|3 years ago|reply
[+] [-] groestl|3 years ago|reply
[+] [-] rurban|3 years ago|reply
[+] [-] pornel|3 years ago|reply
https://lib.rs/crates/dupe-krill#readme-nerding-out-about-th...
[+] [-] JackSlateur|3 years ago|reply
[+] [-] tgp|3 years ago|reply
[+] [-] philsnow|3 years ago|reply
(Not (purely) a reference to the “oh hey Mark” meme, I’m vaguely acquainted with Mark from LUG days)
[+] [-] fbdab103|3 years ago|reply
[+] [-] aarchi|3 years ago|reply
Transact-SQL is also Turing-complete, as proven by a Brainfuck implementation[4].
With that, you can theoretically compute anything :).
[0]: https://wiki.postgresql.org/wiki/Cyclic_Tag_System
[1]: https://blog.coelho.net/database/2013/08/17/turing-sql-1.htm...
[2]: https://wiki.postgresql.org/wiki/Mandelbrot_set
[3]: https://web.archive.org/web/20201111224603/http://assets.en....
[4]: https://stackoverflow.com/questions/900055/is-sql-or-even-ts...
[+] [-] yawaramin|3 years ago|reply
[+] [-] anyfoo|3 years ago|reply
But partial file deduplication is something else...
[+] [-] knagy|3 years ago|reply
[+] [-] blaz0|3 years ago|reply
We also use a SQLite database, and in a manner similar to the OP’s article. We use it to track which content-defined chunks are at which offsets in which files on disk. We deduplicate those chunks on the CDN so you have to download less data when updating the game, but on disk we need to recreate the files as originally uploaded because that’s how games load them.
We’ve expanded the use of this tech quite a bit since the article. For example, my team has started using the patcher internally to manage our 2-million-file vendored code and binaries repo, as a replacement for Git LFS of sorts. Cloning the repo from scratch became 10 times faster, and so did switching branches, especially for remote devs, since the files are served through CloudFront.
Some of the more interesting work we’ve done recently has been optimizing the core content-defined chunking algorithm. The GearHash rolling hash algorithm from FastCDC is decently fast, but we were able to improve chunking speeds by 5x with vectorized implementations and using AES-NI instructions in a clever way to avoid the GearHash table lookups. It can do 11GB/sec per core now using AVX-512 instructions.
Another thing we did recently was build a tool for repacking Unreal Engine Pak files so game assets maintain a consistent layout across versions, similar to what garaetjjte mentioned in a comment above about UnityFS files. This reduced the disk I/O needed to update VALORANT by over 90% and cut update times by half for players. The combination of content-defined chunking with tooling to make game data files friendlier to differential patching can make a huge difference.
[+] [-] GOATS-|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] zubairq|3 years ago|reply
[+] [-] skerit|3 years ago|reply
[+] [-] tripleo1|3 years ago|reply
[+] [-] coppsilgold|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] unknown|3 years ago|reply
[deleted]