top | item 33970836

(no title)

rajatarya | 3 years ago

XetHub Co-founder here. Yes, one illustrative example of the difference is:

Imagine you have a 500MB file (lastmonth.csv) where every day 1MB is changed.

With file-based deduplication every day 500MB will be uploaded, and all clones of the repo will need to download 500MB.

With block-based deduplication, only around the 1MB that changed is uploaded and downloaded.

discuss

unqueued|3 years ago

I combine git-annex with the bup special remote[1], which lets me still externalize big files, while benefiting from block level deduplication. Or depending on your needs, you can just use a tool like bup[2] or borg directly. Bup actually uses the git pack file format and git metadata.

I actually wrote a script which I'm happy to share, that makes this much easier, and even lets you mount your bup repo over .git/annex/objects for direct access.

[1]: https://git-annex.branchable.com/walkthrough/using_bup/

[2]: https://github.com/bup/bup

AustinDev|3 years ago

Have you tested this out with Unreal Engine blueprint files? If you all can do block-based diffing on those, and other binary assets used in game development it'd be huge for game development.

I have a couple ~1TB repositories I've had the misfortune of working with using perforce in the past.

vvanders|3 years ago

Last time I used perforce in anger it did pretty decent with ~800GB repo(checkout+history).

I keep expecting someone to come along and dethrone it but as far as I can tell it hasn't been done yet. The combination of specific filetree views, drop-in proxies, UI-forward and checkout based workflow that works well with unmergeable binary assets still left Git LFS and other solutions in the dust.

+1 on testing this against a moderate size gamedev repo, that usually has some of the harder constraints where code + assets can be coupled and the art portion of a sync can easily top a couple hundred GB.

rajatarya|3 years ago

Not yet. Would be happy to try - can you point me to a project to use?

Do you have a repo you could try us out with?

We have tried a couple Unity projects (41% smaller due to republication) but not much from Unreal projects yet.

civilized|3 years ago

Does that work equally well whether the changes are primarily row-based or primarily column-based?

prirun|3 years ago

HashBackup author here. Your question is (I think) about how well block-based dedup functions on a database - whether rows are changed or columns are changed. This answer is how most block-based dedup software, including HashBackup work.

Block-based dedup can be done either with fixed block sizes or variable block sizes. For a database with fixed page sizes, a fixed block size matching the page size is most efficient. For a database with variable page sizes, a variable block size will work better, assuming there the dedup "chunking" algorithm is fine-grained enough to detect the database page size. For example, if the db used a 4-6K variable page size and the dedup algo used a 1M variable block size, it could not save a single modified db page but would save more like 20 db pages surrounding the modified page.

Your column vs row question depends on how the db stores data, whether key fields are changed, etc. The main dedup efficiency criteria are whether the changes are physically clustered together in the file or whether they are dispersed throughout the file, and how fine-grained the dedup block detection algorithm is.

rajatarya|3 years ago

Yes, see this for more details of how XetHub deduplication: https://xethub.com/assets/docs/xet-specifics/how-xet-dedupli...