Git Large File Storage 1.0

While I'm sure this will help some people use git to address a use case that was previously impossible with git, I can't help but feel that it a bad step overall for the git ecosystem.

It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).

I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.

You could extend git-lfs "pointer" file to support secure distributed storage using Convergent Encryption [1]. Right now, it's 3 lines:

    version https://git-lfs.github.com/spec/v1
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    size 12345

By adding an extra line containing the CHK-SHA256 (Content Hash Key), you could use a distributed p2p network like Freenet to store the large files, while keeping the data secure from other users (who don't have the OID).

    version https://git-lfs.github.com/spec/v2proposed
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    chk sha256:8f32b12a5943f9e0ff658daa9d22eae2ca24d17e23934d7a214614ab2935cdbb
    size 12345

That's how Freenet / Tahoe-LAFS / GnuNET work, basically.

[1] https://en.wikipedia.org/wiki/Convergent_encryption

Mercurial marks their Largefiles[0] support as a "feature of last resort". IE: enabling this breaks the core concept of what a DVCS is, as you now have a central authority you need to talk with. But at the same time, many people that use Git and HG use it with a central authoritative repo.

[0] https://www.mercurial-scm.org/wiki/LargefilesExtension

git-lfs (and similar systems) split up the storage of objects into the regular git object store (for small files) and the large file storage. This allows you to configure how you get and push large files independently of how you get and push regular objects.

A shallow clone gives you some `n` commits of history (and the objects that they point to). Using LFS allows you to have some `m` commits worth of large files.

If you want a completely distributed workflow, and have infinite local storage and infinite bandwidth, you can fetch all the large files when you do a normal `git fetch`. However, most people don't, so you can tweak this to get only parts of the local file history that you're interested in.

Indeed this is a trade off that requires some centralization, but so does your proposed solution of a shallow clone. This adds some subtlety and configurability around that.

I don't see why the centralization is different, can't you just download all large files and upload them somewhere else?

I would love to see a Google Piper for git that load files with a virtual filesystem in FUSE as you access them.

Without locking this is largely useless.

Usually large files are binary blobs(PSD, .ma, etc) and it becomes incredibly easy to blow away someone's work by not pulling before every file you edit(or two people edit at the same time).

As much as some people hate Perforce that's exactly what they are setup to do. Plus their binary syncing algorithms are top-notch. We used to regularly pull ~300GB art repo(for gamedev) in ~20 minutes.

Git is great for code but this seems like square peg, round hole to me.

I've read the replies to vvanders and he's correct. With binaries you really want some sort of global locking (easy with a centralized system, hard with a distributed system).

I believe his (her?) point is that for a very large class of binaries there is just no upside in parallel development, one guy is going squash the other guy's work. You want to serialize those efforts.

We don't have global locks yet but we know how to do them, just waiting for the right sales prospect to "force" us to do them. I'm 90% sure we could add them to BK in about a week.

Git annex solves this without locking nor losing any of the versions - the actual files get different names (based on hashes of the contents), which are referenced by a symlink tracked by git. If two people edit the same file - pointing the symlink at different filenames - you get a regular git merge conflict.

Except perforce has been moving towards streams to compete with git's more desirable workflow, and you can't lock across streams anyway.

After we add support for git lfs we plan to add web ui locking for files, this will allow you to lock files when browsing them and prevent others from uploading them

Won't a central storage for the large files alao make it straightforward to add locking functionality in a future version of git-lfs, or as an add-in? I agree it sure looks like an omission to have a vcs that is centralized and aimed at binary data, without having any locking functionality.

Translation: because it is not useful for you it's not useful for anybody else.

That's nonsense of course.

There are a lot of use cases where this would be very helpful without locking (i.e. jar/dlls).

This is useful now, locking can come later. We don't have to solve every conceivable problem all at once. More progress is made in small incremental steps than big bang leaps.

This is great, we plan to ship alpha LFS support in GitLab CE & EE & .com in 8.1 or 8.2. That is in addition to the git annex support that EE & .com already have for a longer time https://about.gitlab.com/2015/02/17/gitlab-annex-solves-the-...

Is it safe to assume that Gitlab's implementation of Git LFS will allow to host the file storage server on premises and potentially on another machine than the one running Gitlab?

Will you be supporting both indefinitely or is there a plan to transition to a single well-supported solution for large files over the coming N releases?

I haven't been following the various Git large file solutions - can someone comment on how this implementation compares to git-annex or whatever else is out there?

There are a lot of comparisons in the original announcement of LFS on HN https://news.ycombinator.com/item?id=9343021

The complaints about git-lfs make me think we need to tell people about BAM for BitKeeper.

http://www.bitkeeper.com/features_binary_asset_management_ba...

BAM works with a similar idea, instead of saving large files in the local repository, users are allowed to save them in a centralized server. This saves disk space and network transfer time.

However, unlike other solutions, BAM preserves the semantics of distributed development.

Instead of requiring a single or standardized set of servers, every user can have a different BAM server. Data is moved between servers automatically and on demand.

One group in an office might use a single BAM server for storing all their data close and locally. When another development group is started in India, they can use a server local to them. The binary assets will automatically transfer to the India server as commits are pulled between sites.

This allows centralized storage of your data and yet still supports having a team work while completely disconnected from the internet.

I've been using BAM for quite a while (I'm one of the developers of it). I use it to store my photos. I've 55GB of photos in there and backing them up is

    cd photos
    bk push

Works pretty well, when my mom was still alive we pushed them to her imac and the screen saver was pointed at that directory. So she got to see the kids and I got another backup.

If for whatever reason lfs doesn't work for you, check out our solution to large file storage on git: https://github.com/dailymuse/git-fit

We wrote it because of various issues with tools at the time, basically boiling down to an inclination to have a dead simple solution.

I haven't tried lfs, but if it's anything like github's other software, then I'm sure it's substantially better than our tool.

I worked at Unity for a couple years and they are one of the biggest users of (and maintainers of) the Mercurial LargeFiles extension, so I was using that on a daily basis.

I agree that it should be a measure of last resort, but if you can't avoid working with big binary files, it makes the difference between a workflow that is a bit more cumbersome, and one that just grinds to a halt. Getting this functionality in git is great. And it'll mean a huge step forward in collaboration tools for game developers. You pretty much can't avoid big binary files when making games - and so far they've been stuck with SVN or Perforce (or the more adventurous ones are trying out PlasticSCM, which apparently is pretty nice too, but is proprietary and doesn't have a big ecosystem around it like git does). I hope this can lead to a boom of game developers using git.

Yup. I'm using git for game source code and I'm often holding off any commits to graphics/music until project is done. Any workaround outside git means you have two systems to manage and it can get really painful.

We had a pretty good discussion here when this was initially announced six months ago: https://news.ycombinator.com/item?id=9343021 Looks like they've had some success!

Not sure if they were working together with GitHub on this, but Microsoft also announced today that Visual Studio Online Git repos now support Git-LFS with unlimited free storage:

http://blogs.msdn.com/b/visualstudioalm/archive/2015/10/01/a...

I wonder if it could leverage the HDF5 diff tool somehow...

https://www.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Diff

And perhaps GridFTP:

http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp...

Where are the files actually stored? I hear "git lfs server" in the demo video, can this be changed? Can I init my repo and tell it to push all my objects to my own private s3 bucket, or can I only rely on some outside lfs server I don't control?

Not with GitLFS, since it's designed under the assumption that the hostname serving your repo over ssh is also the GitLFS server over HTTP.

However git-fat[1] is an alternative system that works in much the same way, but lets you configure where the files are stored.

[1]: https://github.com/cyaninc/git-fat

This could be a corollary to P. Graham's "don't do anything that scales": don't do anything that involves plonking stupidly large files into version control.

Good thing that Atlassian/BitBucket would also be supporting it: https://blog.bitbucket.org/2015/10/01/contributing-to-git-lf...

And very glad to read that they decided to contribute to this instead of working on their own solution for the same problem. Kudos!

The fact that both Atlassian and GitHub intended to unveil their own almost identical competing solutions, both built in Go, in consecutive sessions at the Git Merge conference (without either being aware of the other) is pretty hilarious.

Git-lfs has been helpful for managing my repo of scientific research data. Hundreds of large-ish excel files, pngs, and hdf5 add up quickly if you're doing lots of small edits.

There's still some warts (don't forget git lfs init after cloning!), buts it's mostly fast and transparent. I also ponied up $5 a month to get 50 gigs or so of lfs storage. Decent deal imho.

As someone new to this idea, the README helped clarify the workflow: https://github.com/github/git-lfs/blob/master/docs/api/READM...

That video is hilarious! I wish we had more awesome videos like this for new technologies!

Is there a solution, that doesn't depend on external storage?

I have data that belongs to my source, but is rather big and I want it inside of my repo.

They do have a reference implementation of the serverside here: https://github.com/github/lfs-test-server - though they themselves don't consider it production ready. But I'm sure it'll either get there in time, or another open source implementation will rise to the challenge (cf. syste's comment about GitLab planning support for this: https://news.ycombinator.com/item?id=10313495 )

What happens if someone that hasn't downloaded their command line tools tries to clone your repo? Will they get the big files too?

I believe they just get the references to the big files, not the files themselves.

No, they will just get the small, metadata containing files

Was I the only one who expected that bear to move on its own?

Any information on GitHub Enterprise support?

GitHub Enterprise has been supporting LFS since 2.2 (current latest is 2.3.3) in a technical preview mode. See here: https://enterprise.github.com/releases/2.2.0/notes

75 comments