Show HN: We scaled Git to support 1 TB repos
279 points| reverius42 | 3 years ago |xethub.com | reply
Unlike Git LFS, we don’t just store the files. We use content-defined chunking and Merkle Trees to dedupe against everything in history. This allows small changes in large files to be stored compactly. Read more here: https://xethub.com/assets/docs/how-xet-deduplication-works
Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go. XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. Even at 1 TB, we know downloading an entire repository is painful, so we built git-xet mount - which, in seconds, provides a user-mode filesystem view over the repo.
XetHub is available today (Linux & Mac today, Windows coming soon) and we would love your feedback!
Read more here:
[+] [-] jrockway|3 years ago|reply
If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)
The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.
Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)
[+] [-] ylow|3 years ago|reply
[+] [-] ylow|3 years ago|reply
[+] [-] ilyt|3 years ago|reply
...isn't that just parsing git diff --name-only A..B tho ? "Process only files that changed since last commit" is extremely simple problem to solve.
[+] [-] chubot|3 years ago|reply
How about cleaning up old versions?
[+] [-] JZL003|3 years ago|reply
[1] When you run `git annex add` it hashes the file and moves the original file to a `.git/annex/data` folder under the hash/content addressable file system, like git. Then it replaces the original file with a symlink to this hashed file path. The file is marked as read only, so any command in any language which tries to write to it will error (you can always `git annex unlock` so you can write to it). If you have duplicated files, they easily point to the same hashed location. As long as you git push normally and back up the `.git/annex/data` you're totally version controlled, and you can share the subset of files as needed
[+] [-] kspacewalk2|3 years ago|reply
[+] [-] timbotron|3 years ago|reply
[+] [-] timsehn|3 years ago|reply
Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.
https://github.com/dolthub/dolt
Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.
[+] [-] ylow|3 years ago|reply
[+] [-] V1ndaar|3 years ago|reply
As someone who'd love to put their data into a git like system, this sounds pretty interesting. Aside from not offering a tier for someone like me who would maybe have a couple of repositories of size O(250GB) it's unclear how e.g. bandwidth would work & whether other people could simply mount and clone the full repo if desired for free etc.
[+] [-] rajatarya|3 years ago|reply
In general, we are thinking about usage-based pricing (which would include bandwidth and storage) - what are your thoughts for that?
Also, where would you be mounting your repos from? We have local caching options that can greatly reduce the overall bandwidth needed to support data center workloads.
[+] [-] TacticalCoder|3 years ago|reply
Is the Merkle true used because it brings something else than deduplication, like chunks integrity verification or something like that?
[+] [-] dandigangi|3 years ago|reply
[+] [-] irrational|3 years ago|reply
[+] [-] Izmaki|3 years ago|reply
[+] [-] mentos|3 years ago|reply
[+] [-] tinco|3 years ago|reply
[+] [-] ryneandal|3 years ago|reply
[+] [-] unqueued|3 years ago|reply
And the datalad project has neuro imaging repos that are tens of TB in size.
Consider whether you actually need to track differences in all of your files. Honestly git-annex is one of the most powerful tools I have ever used. You can use git for tracking changes in text, but use a different system for tracking binaries.
I love how satisfying it is to be able to store the index for hundreds of gigs of files on a floppy disk if I wanted.
[+] [-] polemic|3 years ago|reply
Kart (https://kartproject.org) is built on git to provide data version control for geospatial vector & tabular data. Per-row (feature & attribute) version control and the ability to collaborate with a team of people is sorely missing from those workflows. It's focused on geographic use-cases, but you can work with 'plain old tables' too, with MySQL, PostgreSQL and MSSQL working copies (you don't have to pick - you can push and pull between them).
[+] [-] culanuchachamim|3 years ago|reply
Why do you need 1Tb for repos? What do you store inside, besides code and some images?
[+] [-] dafelst|3 years ago|reply
[+] [-] lazide|3 years ago|reply
I personally would love to be able to store datasets next to code for regression testing, easier deployment, easier dev workstation spin up, etc.
[+] [-] layer8|3 years ago|reply
[+] [-] iFire|3 years ago|reply
[+] [-] bastardoperator|3 years ago|reply
[+] [-] frognumber|3 years ago|reply
The git userspace would need to be able to easily:
1. Not grab all files
2. Got grab the whole version history
... and that's more-or-less it. At that point, it'd do great with large files.
[+] [-] wnzl|3 years ago|reply
[+] [-] ziml77|3 years ago|reply
I know that they're well within their rights to do this as they only ever offered subscription licensing for Semantic Merge, but that doesn't make it suck less to lose access.
[+] [-] COMMENT___|3 years ago|reply
Besides other features, Subversion supports representation sharing. So adding new textual or binary files with identical data won’t increase the size of your repository.
I’m not familiar with ML data sets, but it seems that SVN may work great with them. It already works great for huge and small game dev projects.
[+] [-] Wojtkie|3 years ago|reply
[+] [-] ylow|3 years ago|reply
[+] [-] ledauphin|3 years ago|reply
[+] [-] reverius42|3 years ago|reply
[+] [-] chubot|3 years ago|reply
Can you sync to another machine without Xethub ?
How about cleaning up old files?
[+] [-] ylow|3 years ago|reply
[+] [-] amelius|3 years ago|reply
Also, why can't Git show me an accurate progress-bar while fetching?
[+] [-] reverius42|3 years ago|reply
As for why git can't show you an accurate progress bar while fetching (specifically when using an extension like git-lfs or git-xet), this has to do with the way git extensions work -- each file gets "cleaned" by the extension through a Unix pipe, and the protocol for that is too simple to reflect progress information back to the user. In git-xet, we do write a percent-complete to stdout so you get some more info (but a real progress bar would be nice).
[+] [-] mattewong|3 years ago|reply
[+] [-] amadvance|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] sesm|3 years ago|reply
[+] [-] ylow|3 years ago|reply