top | item 33969908

Show HN: We scaled Git to support 1 TB repos

279 points| reverius42 | 3 years ago |xethub.com | reply

I’ve been in the MLOps space for ~10 years, and data is still the hardest unsolved open problem. Code is versioned using Git, data is stored somewhere else, and context often lives in a 3rd location like Slack or GDocs. This is why we built XetHub, a platform that enables teams to treat data like code, using Git.

Unlike Git LFS, we don’t just store the files. We use content-defined chunking and Merkle Trees to dedupe against everything in history. This allows small changes in large files to be stored compactly. Read more here: https://xethub.com/assets/docs/how-xet-deduplication-works

Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go. XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. Even at 1 TB, we know downloading an entire repository is painful, so we built git-xet mount - which, in seconds, provides a user-mode filesystem view over the repo.

XetHub is available today (Linux & Mac today, Windows coming soon) and we would love your feedback!

144 comments

[+] jrockway|3 years ago|reply

There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.

If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)

The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)

[+] ylow|3 years ago|reply

We have found pointer files to be surprisingly efficient as long as you don't have to actually materialize those files. (Git's internals actually very well done). Our mount mechanism does avoid materializing pointer files which makes it pretty fast even for repos with very large number of files.

[+] ylow|3 years ago|reply

By the way, our mount mechanism has one very interesting novelty. It does not depend on a FUSE driver on Mac :-)

[+] ilyt|3 years ago|reply

> The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

...isn't that just parsing git diff --name-only A..B tho ? "Process only files that changed since last commit" is extremely simple problem to solve.

[+] chubot|3 years ago|reply

Is DVC useful/efficient at storing container images (Docker)? As far as I remember they are just compressed tar files. Does the compression defeat its chunking / differential compression?

How about cleaning up old versions?

[+] JZL003|3 years ago|reply

I also have a lot of issues with versioning data. But look at git annex - it's free, self hosted and has a very easy underlying data structure [1]. So I don't even use the magic commands it has for remote data mounting/multi-device coordination, just backup using basic S3 commands and can use rclone mounting. Very robust, open source, and useful

[1] When you run `git annex add` it hashes the file and moves the original file to a `.git/annex/data` folder under the hash/content addressable file system, like git. Then it replaces the original file with a symlink to this hashed file path. The file is marked as read only, so any command in any language which tries to write to it will error (you can always `git annex unlock` so you can write to it). If you have duplicated files, they easily point to the same hashed location. As long as you git push normally and back up the `.git/annex/data` you're totally version controlled, and you can share the subset of files as needed

[+] kspacewalk2|3 years ago|reply

Sounds like `git annex` is file-level deduplication, whereas this tool is block-level, but with some intelligent, context-specific way of defining how to split up the data (i.e. Content-Defined Chunking). For data management/versioning, that's usually a big difference.

[+] timbotron|3 years ago|reply

If you like git annex check out [datalad](http://handbook.datalad.org/en/latest/), it provides some useful wrappers around git annex oriented towards scientific computing.

[+] timsehn|3 years ago|reply

Founder of DoltHub here. One of my team pointed me at this thread. Congrats on the launch. Great to see more folks tackling the data versioning problem.

Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.

https://github.com/dolthub/dolt

Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.

[+] ylow|3 years ago|reply

CEO/Cofounder here. Thanks! Agreed, we think data versioning is an important problem and we are at related, but opposite parts of the space. (BTW we really wanted gitfordata.com. Or perhaps we can split the domain? OLTP goes here, Unstructured data goes there :-) Shall we chat? )

[+] V1ndaar|3 years ago|reply

You say you support up to 1TB repositories, but from your pricing page all I see is the free tier for up to 20GB and one for teams. The latter doesn't have a price and only a contact option and I assume likely will be too expensive for an individual.

As someone who'd love to put their data into a git like system, this sounds pretty interesting. Aside from not offering a tier for someone like me who would maybe have a couple of repositories of size O(250GB) it's unclear how e.g. bandwidth would work & whether other people could simply mount and clone the full repo if desired for free etc.

[+] rajatarya|3 years ago|reply

XetHub Co-founder here. We are still trying to figure out pricing and would love to understand what sort of pricing tier would work for you.

In general, we are thinking about usage-based pricing (which would include bandwidth and storage) - what are your thoughts for that?

Also, where would you be mounting your repos from? We have local caching options that can greatly reduce the overall bandwidth needed to support data center workloads.

[+] TacticalCoder|3 years ago|reply

What does a Merkle Tree bring here? (honest question) I mean: for content-based addressing of chunks (and hence deduplication of these chunks), a regular tree works too if I'm not mistaken (I may be wrong but I literally wrote a "deduper" splitting files into chunks and using content-based addressing to dedupe the chunks: but I just used a dumb tree).

Is the Merkle true used because it brings something else than deduplication, like chunks integrity verification or something like that?

[+] dandigangi|3 years ago|reply

One monorepo to rule them all and the in the darkness, pull them. - Gandalf, probably

[+] irrational|3 years ago|reply

And in the darkness merge conflicts.

[+] Izmaki|3 years ago|reply

If I had to "version control" a 1 TB large repo - and assuming I wouldn't quit in anger - I would use a tool which is built for this kind of need and has been used in the industry for decades: Perforce.

[+] mentos|3 years ago|reply

I work in gamedev and think perforce is good but far from great. Would love to see someone bring some competition to the space maybe XetHub can.

[+] tinco|3 years ago|reply

So, you wouldn't consider using a new tool that someone developed to solve the same problem despite an older solution already existing? Your advice to that someone is to just use the old solution?

[+] ryneandal|3 years ago|reply

This was my thought as well. Perforce has its own issues, but is an industry standard in game dev for a reason: it can handle immense amounts of data.

[+] unqueued|3 years ago|reply

I have a 1.96 TB git repo: https://github.com/unqueued/repo.macintoshgarden.org-fileset (It is a mirror of a Macintosh abandoneware site)

  git annex info .

Of course, it uses pointer files for the binary blobs that are not going to change much anyway.

And the datalad project has neuro imaging repos that are tens of TB in size.

Consider whether you actually need to track differences in all of your files. Honestly git-annex is one of the most powerful tools I have ever used. You can use git for tracking changes in text, but use a different system for tracking binaries.

I love how satisfying it is to be able to store the index for hundreds of gigs of files on a floppy disk if I wanted.

[+] polemic|3 years ago|reply

There seem to be a lot of data version control systems built around ML pipelines or software development needs, but not so much on the sort of data editing that happens outside of software development & analysis.

Kart (https://kartproject.org) is built on git to provide data version control for geospatial vector & tabular data. Per-row (feature & attribute) version control and the ability to collaborate with a team of people is sorely missing from those workflows. It's focused on geographic use-cases, but you can work with 'plain old tables' too, with MySQL, PostgreSQL and MSSQL working copies (you don't have to pick - you can push and pull between them).

[+] culanuchachamim|3 years ago|reply

Maybe a silly question:

Why do you need 1Tb for repos? What do you store inside, besides code and some images?

[+] dafelst|3 years ago|reply

Repositories for games are often larger than 1TB, and with things like UE5's Nanite becoming more viable, they're only going to get bigger.

[+] lazide|3 years ago|reply

A whole lot of images?

I personally would love to be able to store datasets next to code for regression testing, easier deployment, easier dev workstation spin up, etc.

[+] layer8|3 years ago|reply

Some docker images? ;)

[+] iFire|3 years ago|reply

https://github.com/facebook/sapling is doing good work and they are suggesting their git server for large repositories exists.

[+] bastardoperator|3 years ago|reply

I actually encountered a 4TB git repo. After pulling all the binary shit out of it the repo was actually 200MB. Anything that promotes treating git like a filesystem is a bad idea in my opinion.

[+] frognumber|3 years ago|reply

Yes... and no. The git userspace is horrible for this. The git data model is wonderful.

The git userspace would need to be able to easily:

1. Not grab all files

2. Got grab the whole version history

... and that's more-or-less it. At that point, it'd do great with large files.

[+] wnzl|3 years ago|reply

Just in case if you are wondering about alternatives: there is Unity’s Plastic https://unity.com/products/plastic-scm which happens to use bidirectional sync with git. I’m curious how this solution compares to it! I’ll definitely give it a try over the weekend!

[+] ziml77|3 years ago|reply

I was already upset about Codice Software pulling Semantic Merge and only making it available as an integrated part of Plastic SCM. Now that I see the reason such a useful tool was taken away was to stuff the pockets of a large company, I'm fuming.

I know that they're well within their rights to do this as they only ever offered subscription licensing for Semantic Merge, but that doesn't make it suck less to lose access.

[+] COMMENT___|3 years ago|reply

What about SVN?

Besides other features, Subversion supports representation sharing. So adding new textual or binary files with identical data won’t increase the size of your repository.

I’m not familiar with ML data sets, but it seems that SVN may work great with them. It already works great for huge and small game dev projects.

[+] Wojtkie|3 years ago|reply

Can I upload a full .pbix file to this and use it for versioning? If so, I'd use it in a heartbeat.

[+] ylow|3 years ago|reply

CEO/Cofounder here. We are file format agnostic and will happily take everything. Not too familiar with the needs around pbix, but please do try it out and let us know what you think!

[+] ledauphin|3 years ago|reply

The link takes me to a login page. It would be nice to see that fixed to somehow match the title.

[+] reverius42|3 years ago|reply

Visit https://xetdata.com for more info! (Sorry, can't edit the post link now.)

[+] chubot|3 years ago|reply

Can it be used to store container images (Docker)? As far as I remember they are just compressed tar files. Does the compression defeat Xet's own chunking?

Can you sync to another machine without Xethub ?

How about cleaning up old files?

[+] ylow|3 years ago|reply

Yeah... The compression does defeat the chunking (your mileage may vary. We do a small amount of dedupe in some experiments but never quite investigated it in detail.). That said, we have experimental preprocessors / chunkers that are file-type specific that we could potentially do something about tar.gz. Not something we have explored much yet.

[+] amelius|3 years ago|reply

Does this fix the problem that Git becomes unreasonably slow when you have large binary files in the repo?

Also, why can't Git show me an accurate progress-bar while fetching?

[+] reverius42|3 years ago|reply

Mostly! (At the moment, it doesn't fully fix the slowdown associated with storing large binary files, but reduces it by 90-99%. We're working on improving to closer to 100% that by moving even the Merkle Tree storage outside the git repo contents.)

As for why git can't show you an accurate progress bar while fetching (specifically when using an extension like git-lfs or git-xet), this has to do with the way git extensions work -- each file gets "cleaned" by the extension through a Unix pipe, and the protocol for that is too simple to reflect progress information back to the user. In git-xet, we do write a percent-complete to stdout so you get some more info (but a real progress bar would be nice).

[+] mattewong|3 years ago|reply

Thank you for posting this. Is there any way to access a Xet dataset via a URL (assuming the dataset owner has opted to share in that manner) so that, for example, one could visit a web page that contains some embedded code (JS, WASM etc) which pulls the Xet data into the page for processing?

[+] amadvance|3 years ago|reply

How data is split in chunks ? Just curious.

[+] unknown|3 years ago|reply

[deleted]

[+] sesm|3 years ago|reply

They mention 'content-defined chunking', but it as far as understand it requires different chunking algorithms for different content types. Does it support plugins for chunking different file formats?

[+] ylow|3 years ago|reply

CEO/Cofounder here! Content defined chunking. Specifically a variation of FastCDC. We have a paper coming out soon with a lot more technical details.