Craziest thing I ever used SQLite for: partial file deduplication (2022)

[+] megous|3 years ago|reply

I used a similar trick to stuff 14 Linux distros onto a 6 GiB sized SD card image using btrfs filesystem. The distros share a lot of common files, so this works well to save space. (btrfs also supports the data block sharing CoW feature as APFS)

https://xnux.eu/p-boot-demo/

I had to write a custom tool to do it efficiently during image creation.

[+] tarasglek|3 years ago|reply

This was neat. Found the important bits in https://megous.com/git/pinephone-multi-boot/tree/tools/main....

[+] itsthecourier|3 years ago|reply

You, savage. Kudos.

[+] mastax|3 years ago|reply

That's really cool. There are a bunch of tools that will let you symlink or hard link deduplicate files, but being able to do block-level dedupes simply by leaning on the filesystem is nice.

It sometimes feels like games are made to thwart this type of thing. They often use packfiles, basically filesystems within files optimized to look up assets quickly. Also perhaps they allowed optimized data layout from when consoles had slow spinning hard drives. The upshot is that a tiny patch inserting a line of code in a script may offset hundreds of megabytes of other data in the packfiles, causing the block hashes to no longer match up. Do any filesystems model inserts in some way? I'm pretty sure Steam updates can handle situations like that. I frequently see updates which download a tiny amount (kilobytes) but write a huge amount to disk (gigabytes), and I can't think of any other cause. (Assuming developers aren't using hilariously un-compressed assets).

[+] mikepurvis|3 years ago|reply

Yes, Steamworks encourages developers to align assets to certain boundaries to help ensure efficient patching later, see the documentation here: https://partner.steamgames.com/doc/sdk/uploading

[+] panzi|3 years ago|reply

Some games just add an update pak that overloads the changed files, not changing the original pak file. So the installation size of such games might grow a lot on updates, wasting a lot of space by keeping the old files, but the files are immutable at least and rollbacks of updates are theoretically possible.

[+] account42|3 years ago|reply

Steam resisted any patching mechanism for updates for a long time. Even when they revamped the update system [0] they instead opted to split updates into separate files for their own pack format [1]. They seem to have relented though - I guess other developers were not as motivated to optimize their asset layout for Steam when Valve is the one paying for the bandwidth.

[0] https://developer.valvesoftware.com/wiki/SteamPipe

[1] https://developer.valvesoftware.com/wiki/VPK

[+] beecafe|3 years ago|reply

What about using content defined chunking? Then inserting a few lines shouldn't cause all the subsequent chunks to shift

[+] evmar|3 years ago|reply

Coincidentally, I just saw that the "USENIX Test of Time Award" for 2023 was an analysis for deduplication purposes of real-world files. I found Figure 1 particularly interesting in that partial deduplication didn't save much over whole-file based deduplication in practice.

https://www.usenix.org/conferences/test-of-time-awards

[+] emmelaich|3 years ago|reply

You'd expect that in general, but for developers updating content the partial would often win.

[+] hultner|3 years ago|reply

Very cool and probably a fun exercise, but I would probably put the data on a ZFS volume with dedupe instead, which from reading this implementation is pretty much the same thing to my layman’s perspective. I could also add compression to the same dataset.

[+] traceroute66|3 years ago|reply

> I would probably put the data on a ZFS volume with dedupe instead

But doesn't ZFS dedupe eat RAM for breakfast ? Double-digit GB RAM per TB data IIRC ?

[+] totetsu|3 years ago|reply

My recent 'copy-on-write' foot gun was using ZFS in ubuntu. There seems to be some automatic config that makes zfs snapshots each time you apt install something. This lead to wondering why on earth no matter how many files I deleted my disk usage was not reducing from 100%. All those deleted files still existed in the sanpshots. coupled with some bug that makes the snaps shots report their size as much smaller than they really were.

[+] tripleo1|3 years ago|reply

ZFS is fun but it locks up for unexplained reasons sometimes

[+] sgtnoodle|3 years ago|reply

That's neat! In retrospect, did the database make the problem more tractable from an algorithmic or practical standpoint, or was it mostly just using the tools you're familiar with? If I were to approach the same problem, I likely would have tried to keep all the data in RAM and serialize it out to a file for incremental runs.

[+] qsort|3 years ago|reply

IME a database does not algorithmically improve anything, but it does make at your disposal some very efficient data structures that are definitely nontrivial to replicate. This is especially the case if you have to work with datasets that don't fit in memory!

An added bonus is that you can use SQL.

[+] theamk|3 years ago|reply

Definitely the latter.

There are many of other products which implement "keep track of partial file contents to deduplicate", for example casync borg and restic.

None of them use sqlite for metadata storage, which kinda makes sense - in author's schema, each block uses (fileid, block_index, hash) tuple, which is 2/3 wasted space, as fileids are mostly same (if files are big) and block_index is always increasing. A custom data structure would be both faster and more efficient.

[+] nonethewiser|3 years ago|reply

I have used SQLite to deduplicate csvs when it couldn’t all fit into memory.

I imagine Dask would be a better tool for this job but I wasn’t familiar with it at the time and it did allow me to deduplicate using Python and no external dependencies.

[+] leetbulb|3 years ago|reply

"There is no file path, because macOS lets you look up files by id. The hash values are cryptographic hashes truncated to 64 bits and reinterpreted as integers."

Is the author implying that APFS or HFS uses this method to calculate the file ID? I am unable to find any information regarding this. From what I understand, w.r.t APFS, the file ID is a combination of the inode OID and genID.

[+] jonhohle|3 years ago|reply

The hash values are part of his implementation and don’t have anything to do with APFS or HFS. The sentence after your quote says, “I picked the BLAKE3 hash function…” indicating it was something he chose, not a FS implementation detail.

[+] cerved|3 years ago|reply

I assumed they are taking about inodes

[+] maxyurk|3 years ago|reply

https://en.m.wikipedia.org/wiki/Content-addressable_storage

[+] groestl|3 years ago|reply

Also relevant in the context of partial file deduplication (to get robustness wrt inserts and deletions): https://en.wikipedia.org/wiki/Rolling_hash

[+] rurban|3 years ago|reply

Using proper a hash table with a bloom filter would save you the useless pass through a b-tree though. Much faster and much less memory

[+] pornel|3 years ago|reply

With a b-tree you can do even better. Instead of hashing entire files ahead of time, you can avoid hashing more than the minimum prefix required to conclude a file is unique, by building the b-tree lazily. I don't think it can be done with sqlite though, as it requires hashing the files while traversing the b-tree.

https://lib.rs/crates/dupe-krill#readme-nerding-out-about-th...

[+] JackSlateur|3 years ago|reply

Looks like https://github.com/markfasheh/duperemove

[+] tgp|3 years ago|reply

The other day I found out that not only ZFS and Btrfs have a dedup feature, but also XFS (which I use on my home partition) supports a feature called reflinks aka block sharing. So I threw duperemove at ~/.virtualenvs and ~/.cache/bazel and freed up a good 15GB of disk space. Good stuff.

[+] philsnow|3 years ago|reply

Oh, hey Mark

(Not (purely) a reference to the “oh hey Mark” meme, I’m vaguely acquainted with Mark from LUG days)

[+] fbdab103|3 years ago|reply

Does anyone else have any other unorthodox use cases? I love SQLite, and am always happy to ram this square peg into a round hole.

[+] aarchi|3 years ago|reply

PostgreSQL is Turing-complete, as proven by implementations of a cyclic tag system[0] and Turing machine[1]. The Mandelbrot set[2] and travelling-salesman problem[3] have also been implemented in it.

Transact-SQL is also Turing-complete, as proven by a Brainfuck implementation[4].

With that, you can theoretically compute anything :).

[0]: https://wiki.postgresql.org/wiki/Cyclic_Tag_System

[1]: https://blog.coelho.net/database/2013/08/17/turing-sql-1.htm...

[2]: https://wiki.postgresql.org/wiki/Mandelbrot_set

[3]: https://web.archive.org/web/20201111224603/http://assets.en....

[4]: https://stackoverflow.com/questions/900055/is-sql-or-even-ts...

[+] yawaramin|3 years ago|reply

Coincidentally I just wrote this about using SQLite for managing i18n strings: https://dev.to/yawaramin/translations-without-the-tears-1n9k

[+] anyfoo|3 years ago|reply

Oh sweet. I've definitely used sqlite (in a zsh script) for file deduplication. Very simple, mostly just rows consisting of paths and file content hash.

But partial file deduplication is something else...

[+] knagy|3 years ago|reply

Reminds me of this article from Riot Games about how they deliver patches: https://technology.riotgames.com/news/supercharging-data-del...

[+] blaz0|3 years ago|reply

I appreciate seeing this article that I wrote 4 years ago resurface :) Happy to answer any questions about it or the tech behind it.

We also use a SQLite database, and in a manner similar to the OP’s article. We use it to track which content-defined chunks are at which offsets in which files on disk. We deduplicate those chunks on the CDN so you have to download less data when updating the game, but on disk we need to recreate the files as originally uploaded because that’s how games load them.

We’ve expanded the use of this tech quite a bit since the article. For example, my team has started using the patcher internally to manage our 2-million-file vendored code and binaries repo, as a replacement for Git LFS of sorts. Cloning the repo from scratch became 10 times faster, and so did switching branches, especially for remote devs, since the files are served through CloudFront.

Some of the more interesting work we’ve done recently has been optimizing the core content-defined chunking algorithm. The GearHash rolling hash algorithm from FastCDC is decently fast, but we were able to improve chunking speeds by 5x with vectorized implementations and using AES-NI instructions in a clever way to avoid the GearHash table lookups. It can do 11GB/sec per core now using AVX-512 instructions.

Another thing we did recently was build a tool for repacking Unreal Engine Pak files so game assets maintain a consistent layout across versions, similar to what garaetjjte mentioned in a comment above about UnityFS files. This reduced the disk I/O needed to update VALORANT by over 90% and cut update times by half for players. The combination of content-defined chunking with tooling to make game data files friendlier to differential patching can make a huge difference.

[+] GOATS-|3 years ago|reply

Thanks for the link, that was a fun read.

[+] unknown|3 years ago|reply

[deleted]

[+] unknown|3 years ago|reply

[deleted]

[+] unknown|3 years ago|reply

[deleted]

[+] zubairq|3 years ago|reply

This is quite amazing. I actually built something even more crazy with Sqlite where I broke up the files into parts and then made hashes of the parts, so in total it would use less space than the sum of the files' space. i used this for a similarity engine where I would try to see how similar different files were

[+] skerit|3 years ago|reply

I don't get it. How does this SQLite database interact with the AFS volume?

[+] tripleo1|3 years ago|reply

Slightly off topic -> there was a project I seen on gh that claimed to support system administration using a relational tables. Something like everything in SQL. I thought it might be a cool idea.

73 comments