top | item 10313083

Git Large File Storage 1.0

276 points| kccqzy | 10 years ago |github.com | reply

75 comments

order
[+] onionjake|10 years ago|reply
While I'm sure this will help some people use git to address a use case that was previously impossible with git, I can't help but feel that it a bad step overall for the git ecosystem.

It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).

I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.

[+] chkmate|10 years ago|reply
You could extend git-lfs "pointer" file to support secure distributed storage using Convergent Encryption [1]. Right now, it's 3 lines:

    version https://git-lfs.github.com/spec/v1
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    size 12345
By adding an extra line containing the CHK-SHA256 (Content Hash Key), you could use a distributed p2p network like Freenet to store the large files, while keeping the data secure from other users (who don't have the OID).

    version https://git-lfs.github.com/spec/v2proposed
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    chk sha256:8f32b12a5943f9e0ff658daa9d22eae2ca24d17e23934d7a214614ab2935cdbb
    size 12345
That's how Freenet / Tahoe-LAFS / GnuNET work, basically.

[1] https://en.wikipedia.org/wiki/Convergent_encryption

[+] kyrra|10 years ago|reply
Mercurial marks their Largefiles[0] support as a "feature of last resort". IE: enabling this breaks the core concept of what a DVCS is, as you now have a central authority you need to talk with. But at the same time, many people that use Git and HG use it with a central authoritative repo.

[0] https://www.mercurial-scm.org/wiki/LargefilesExtension

[+] ethomson|10 years ago|reply
git-lfs (and similar systems) split up the storage of objects into the regular git object store (for small files) and the large file storage. This allows you to configure how you get and push large files independently of how you get and push regular objects.

A shallow clone gives you some `n` commits of history (and the objects that they point to). Using LFS allows you to have some `m` commits worth of large files.

If you want a completely distributed workflow, and have infinite local storage and infinite bandwidth, you can fetch all the large files when you do a normal `git fetch`. However, most people don't, so you can tweak this to get only parts of the local file history that you're interested in.

Indeed this is a trade off that requires some centralization, but so does your proposed solution of a shallow clone. This adds some subtlety and configurability around that.

[+] sytse|10 years ago|reply
I don't see why the centralization is different, can't you just download all large files and upload them somewhere else?
[+] hokkos|10 years ago|reply
I would love to see a Google Piper for git that load files with a virtual filesystem in FUSE as you access them.
[+] vvanders|10 years ago|reply
Without locking this is largely useless.

Usually large files are binary blobs(PSD, .ma, etc) and it becomes incredibly easy to blow away someone's work by not pulling before every file you edit(or two people edit at the same time).

As much as some people hate Perforce that's exactly what they are setup to do. Plus their binary syncing algorithms are top-notch. We used to regularly pull ~300GB art repo(for gamedev) in ~20 minutes.

Git is great for code but this seems like square peg, round hole to me.

[+] luckydude|10 years ago|reply
I've read the replies to vvanders and he's correct. With binaries you really want some sort of global locking (easy with a centralized system, hard with a distributed system).

I believe his (her?) point is that for a very large class of binaries there is just no upside in parallel development, one guy is going squash the other guy's work. You want to serialize those efforts.

We don't have global locks yet but we know how to do them, just waiting for the right sales prospect to "force" us to do them. I'm 90% sure we could add them to BK in about a week.

[+] icebraining|10 years ago|reply
Git annex solves this without locking nor losing any of the versions - the actual files get different names (based on hashes of the contents), which are referenced by a symlink tracked by git. If two people edit the same file - pointing the symlink at different filenames - you get a regular git merge conflict.
[+] jayd16|10 years ago|reply
Except perforce has been moving towards streams to compete with git's more desirable workflow, and you can't lock across streams anyway.
[+] sytse|10 years ago|reply
After we add support for git lfs we plan to add web ui locking for files, this will allow you to lock files when browsing them and prevent others from uploading them
[+] alkonaut|10 years ago|reply
Won't a central storage for the large files alao make it straightforward to add locking functionality in a future version of git-lfs, or as an add-in? I agree it sure looks like an omission to have a vcs that is centralized and aimed at binary data, without having any locking functionality.
[+] bmurphy1976|10 years ago|reply
Translation: because it is not useful for you it's not useful for anybody else.

That's nonsense of course.

There are a lot of use cases where this would be very helpful without locking (i.e. jar/dlls).

This is useful now, locking can come later. We don't have to solve every conceivable problem all at once. More progress is made in small incremental steps than big bang leaps.

[+] sytse|10 years ago|reply
This is great, we plan to ship alpha LFS support in GitLab CE & EE & .com in 8.1 or 8.2. That is in addition to the git annex support that EE & .com already have for a longer time https://about.gitlab.com/2015/02/17/gitlab-annex-solves-the-...
[+] zertrin|10 years ago|reply
Is it safe to assume that Gitlab's implementation of Git LFS will allow to host the file storage server on premises and potentially on another machine than the one running Gitlab?
[+] nixgeek|10 years ago|reply
Will you be supporting both indefinitely or is there a plan to transition to a single well-supported solution for large files over the coming N releases?
[+] et1337|10 years ago|reply
I haven't been following the various Git large file solutions - can someone comment on how this implementation compares to git-annex or whatever else is out there?
[+] wscott|10 years ago|reply
The complaints about git-lfs make me think we need to tell people about BAM for BitKeeper.

http://www.bitkeeper.com/features_binary_asset_management_ba...

BAM works with a similar idea, instead of saving large files in the local repository, users are allowed to save them in a centralized server. This saves disk space and network transfer time.

However, unlike other solutions, BAM preserves the semantics of distributed development.

Instead of requiring a single or standardized set of servers, every user can have a different BAM server. Data is moved between servers automatically and on demand.

One group in an office might use a single BAM server for storing all their data close and locally. When another development group is started in India, they can use a server local to them. The binary assets will automatically transfer to the India server as commits are pulled between sites.

This allows centralized storage of your data and yet still supports having a team work while completely disconnected from the internet.

[+] luckydude|10 years ago|reply
I've been using BAM for quite a while (I'm one of the developers of it). I use it to store my photos. I've 55GB of photos in there and backing them up is

    cd photos
    bk push
Works pretty well, when my mom was still alive we pushed them to her imac and the screen saver was pointed at that directory. So she got to see the kids and I got another backup.
[+] m0th87|10 years ago|reply
If for whatever reason lfs doesn't work for you, check out our solution to large file storage on git: https://github.com/dailymuse/git-fit

We wrote it because of various issues with tools at the time, basically boiling down to an inclination to have a dead simple solution.

I haven't tried lfs, but if it's anything like github's other software, then I'm sure it's substantially better than our tool.

[+] m12k|10 years ago|reply
I worked at Unity for a couple years and they are one of the biggest users of (and maintainers of) the Mercurial LargeFiles extension, so I was using that on a daily basis.

I agree that it should be a measure of last resort, but if you can't avoid working with big binary files, it makes the difference between a workflow that is a bit more cumbersome, and one that just grinds to a halt. Getting this functionality in git is great. And it'll mean a huge step forward in collaboration tools for game developers. You pretty much can't avoid big binary files when making games - and so far they've been stuck with SVN or Perforce (or the more adventurous ones are trying out PlasticSCM, which apparently is pretty nice too, but is proprietary and doesn't have a big ecosystem around it like git does). I hope this can lead to a boom of game developers using git.

[+] babuskov|10 years ago|reply
Yup. I'm using git for game source code and I'm often holding off any commits to graphics/music until project is done. Any workaround outside git means you have two systems to manage and it can get really painful.
[+] res0nat0r|10 years ago|reply
Where are the files actually stored? I hear "git lfs server" in the demo video, can this be changed? Can I init my repo and tell it to push all my objects to my own private s3 bucket, or can I only rely on some outside lfs server I don't control?
[+] thristian|10 years ago|reply
Not with GitLFS, since it's designed under the assumption that the hostname serving your repo over ssh is also the GitLFS server over HTTP.

However git-fat[1] is an alternative system that works in much the same way, but lets you configure where the files are stored.

[1]: https://github.com/cyaninc/git-fat

[+] kazinator|10 years ago|reply
This could be a corollary to P. Graham's "don't do anything that scales": don't do anything that involves plonking stupidly large files into version control.
[+] mindprince|10 years ago|reply
Good thing that Atlassian/BitBucket would also be supporting it: https://blog.bitbucket.org/2015/10/01/contributing-to-git-lf...

And very glad to read that they decided to contribute to this instead of working on their own solution for the same problem. Kudos!

[+] kannonboy|10 years ago|reply
The fact that both Atlassian and GitHub intended to unveil their own almost identical competing solutions, both built in Go, in consecutive sessions at the Git Merge conference (without either being aware of the other) is pretty hilarious.
[+] elcritch|10 years ago|reply
Git-lfs has been helpful for managing my repo of scientific research data. Hundreds of large-ish excel files, pngs, and hdf5 add up quickly if you're doing lots of small edits.

There's still some warts (don't forget git lfs init after cloning!), buts it's mostly fast and transparent. I also ponied up $5 a month to get 50 gigs or so of lfs storage. Decent deal imho.

[+] lspears|10 years ago|reply
That video is hilarious! I wish we had more awesome videos like this for new technologies!
[+] k__|10 years ago|reply
Is there a solution, that doesn't depend on external storage?

I have data that belongs to my source, but is rather big and I want it inside of my repo.

[+] tyoverby|10 years ago|reply
What happens if someone that hasn't downloaded their command line tools tries to clone your repo? Will they get the big files too?
[+] deevus|10 years ago|reply
I believe they just get the references to the big files, not the files themselves.
[+] micmcg|10 years ago|reply
No, they will just get the small, metadata containing files
[+] anotherevan|10 years ago|reply
Was I the only one who expected that bear to move on its own?