(no title)
rajatarya | 3 years ago
Imagine you have a 500MB file (lastmonth.csv) where every day 1MB is changed.
With file-based deduplication every day 500MB will be uploaded, and all clones of the repo will need to download 500MB.
With block-based deduplication, only around the 1MB that changed is uploaded and downloaded.
unqueued|3 years ago
I actually wrote a script which I'm happy to share, that makes this much easier, and even lets you mount your bup repo over .git/annex/objects for direct access.
[1]: https://git-annex.branchable.com/walkthrough/using_bup/
[2]: https://github.com/bup/bup
AustinDev|3 years ago
I have a couple ~1TB repositories I've had the misfortune of working with using perforce in the past.
vvanders|3 years ago
I keep expecting someone to come along and dethrone it but as far as I can tell it hasn't been done yet. The combination of specific filetree views, drop-in proxies, UI-forward and checkout based workflow that works well with unmergeable binary assets still left Git LFS and other solutions in the dust.
+1 on testing this against a moderate size gamedev repo, that usually has some of the harder constraints where code + assets can be coupled and the art portion of a sync can easily top a couple hundred GB.
rajatarya|3 years ago
Do you have a repo you could try us out with?
We have tried a couple Unity projects (41% smaller due to republication) but not much from Unreal projects yet.
civilized|3 years ago
prirun|3 years ago
Block-based dedup can be done either with fixed block sizes or variable block sizes. For a database with fixed page sizes, a fixed block size matching the page size is most efficient. For a database with variable page sizes, a variable block size will work better, assuming there the dedup "chunking" algorithm is fine-grained enough to detect the database page size. For example, if the db used a 4-6K variable page size and the dedup algo used a 1M variable block size, it could not save a single modified db page but would save more like 20 db pages surrounding the modified page.
Your column vs row question depends on how the db stores data, whether key fields are changed, etc. The main dedup efficiency criteria are whether the changes are physically clustered together in the file or whether they are dispersed throughout the file, and how fine-grained the dedup block detection algorithm is.
rajatarya|3 years ago