While I'm sure this will help some people use git to address a use case that was previously impossible with git, I can't help but feel that it a bad step overall for the git ecosystem.
It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).
I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.
You could extend git-lfs "pointer" file to support secure distributed storage using Convergent Encryption [1]. Right now, it's 3 lines:
version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
By adding an extra line containing the CHK-SHA256 (Content Hash Key), you could use a distributed p2p network like Freenet to store the large files, while keeping the data secure from other users (who don't have the OID).
version https://git-lfs.github.com/spec/v2proposed
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
chk sha256:8f32b12a5943f9e0ff658daa9d22eae2ca24d17e23934d7a214614ab2935cdbb
size 12345
That's how Freenet / Tahoe-LAFS / GnuNET work, basically.
Mercurial marks their Largefiles[0] support as a "feature of last resort". IE: enabling this breaks the core concept of what a DVCS is, as you now have a central authority you need to talk with. But at the same time, many people that use Git and HG use it with a central authoritative repo.
git-lfs (and similar systems) split up the storage of objects into the regular git object store (for small files) and the large file storage. This allows you to configure how you get and push large files independently of how you get and push regular objects.
A shallow clone gives you some `n` commits of history (and the objects that they point to). Using LFS allows you to have some `m` commits worth of large files.
If you want a completely distributed workflow, and have infinite local storage and infinite bandwidth, you can fetch all the large files when you do a normal `git fetch`. However, most people don't, so you can tweak this to get only parts of the local file history that you're interested in.
Indeed this is a trade off that requires some centralization, but so does your proposed solution of a shallow clone. This adds some subtlety and configurability around that.
Usually large files are binary blobs(PSD, .ma, etc) and it becomes incredibly easy to blow away someone's work by not pulling before every file you edit(or two people edit at the same time).
As much as some people hate Perforce that's exactly what they are setup to do. Plus their binary syncing algorithms are top-notch. We used to regularly pull ~300GB art repo(for gamedev) in ~20 minutes.
Git is great for code but this seems like square peg, round hole to me.
I've read the replies to vvanders and he's correct. With binaries you really want some sort of global locking (easy with a centralized system, hard with a distributed system).
I believe his (her?) point is that for a very large class of binaries there is just no upside in parallel development, one guy is going squash the other guy's work. You want to serialize those efforts.
We don't have global locks yet but we know how to do them, just waiting for the right sales prospect to "force" us to do them. I'm 90% sure we could add them to BK in about a week.
Git annex solves this without locking nor losing any of the versions - the actual files get different names (based on hashes of the contents), which are referenced by a symlink tracked by git. If two people edit the same file - pointing the symlink at different filenames - you get a regular git merge conflict.
After we add support for git lfs we plan to add web ui locking for files, this will allow you to lock files when browsing them and prevent others from uploading them
Won't a central storage for the large files alao make it straightforward to add locking functionality in a future version of git-lfs, or as an add-in? I agree it sure looks like an omission to have a vcs that is centralized and aimed at binary data, without having any locking functionality.
Translation: because it is not useful for you it's not useful for anybody else.
That's nonsense of course.
There are a lot of use cases where this would be very helpful without locking (i.e. jar/dlls).
This is useful now, locking can come later. We don't have to solve every conceivable problem all at once. More progress is made in small incremental steps than big bang leaps.
Is it safe to assume that Gitlab's implementation of Git LFS will allow to host the file storage server on premises and potentially on another machine than the one running Gitlab?
Will you be supporting both indefinitely or is there a plan to transition to a single well-supported solution for large files over the coming N releases?
I haven't been following the various Git large file solutions - can someone comment on how this implementation compares to git-annex or whatever else is out there?
BAM works with a similar idea, instead of saving large files in the local repository, users are allowed to save them in a centralized server. This saves disk space and network transfer time.
However, unlike other solutions, BAM preserves the semantics of distributed development.
Instead of requiring a single or standardized set of servers, every user can have a different BAM server. Data is moved between servers automatically and on demand.
One group in an office might use a single BAM server for storing all their data close and locally. When another development group is started in India, they can use a server local to them. The binary assets will automatically transfer to the India server as commits are pulled between sites.
This allows centralized storage of your data and yet still supports having a team work while completely disconnected from the internet.
I've been using BAM for quite a while (I'm one of the developers of it). I use it to store my photos. I've 55GB of photos in there and backing them up is
cd photos
bk push
Works pretty well, when my mom was still alive we pushed them to her imac and the screen saver was pointed at that directory. So she got to see the kids and I got another backup.
I worked at Unity for a couple years and they are one of the biggest users of (and maintainers of) the Mercurial LargeFiles extension, so I was using that on a daily basis.
I agree that it should be a measure of last resort, but if you can't avoid working with big binary files, it makes the difference between a workflow that is a bit more cumbersome, and one that just grinds to a halt. Getting this functionality in git is great. And it'll mean a huge step forward in collaboration tools for game developers. You pretty much can't avoid big binary files when making games - and so far they've been stuck with SVN or Perforce (or the more adventurous ones are trying out PlasticSCM, which apparently is pretty nice too, but is proprietary and doesn't have a big ecosystem around it like git does). I hope this can lead to a boom of game developers using git.
Yup. I'm using git for game source code and I'm often holding off any commits to graphics/music until project is done. Any workaround outside git means you have two systems to manage and it can get really painful.
Not sure if they were working together with GitHub on this, but Microsoft also announced today that Visual Studio Online Git repos now support Git-LFS with unlimited free storage:
Where are the files actually stored? I hear "git lfs server" in the demo video, can this be changed? Can I init my repo and tell it to push all my objects to my own private s3 bucket, or can I only rely on some outside lfs server I don't control?
This could be a corollary to P. Graham's "don't do anything that scales": don't do anything that involves plonking stupidly large files into version control.
The fact that both Atlassian and GitHub intended to unveil their own almost identical competing solutions, both built in Go, in consecutive sessions at the Git Merge conference (without either being aware of the other) is pretty hilarious.
Git-lfs has been helpful for managing my repo of scientific research data. Hundreds of large-ish excel files, pngs, and hdf5 add up quickly if you're doing lots of small edits.
There's still some warts (don't forget git lfs init after cloning!), buts it's mostly fast and transparent. I also ponied up $5 a month to get 50 gigs or so of lfs storage. Decent deal imho.
They do have a reference implementation of the serverside here: https://github.com/github/lfs-test-server - though they themselves don't consider it production ready. But I'm sure it'll either get there in time, or another open source implementation will rise to the challenge (cf. syste's comment about GitLab planning support for this: https://news.ycombinator.com/item?id=10313495 )
[+] [-] onionjake|10 years ago|reply
It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).
I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.
[+] [-] chkmate|10 years ago|reply
[1] https://en.wikipedia.org/wiki/Convergent_encryption
[+] [-] kyrra|10 years ago|reply
[0] https://www.mercurial-scm.org/wiki/LargefilesExtension
[+] [-] ethomson|10 years ago|reply
A shallow clone gives you some `n` commits of history (and the objects that they point to). Using LFS allows you to have some `m` commits worth of large files.
If you want a completely distributed workflow, and have infinite local storage and infinite bandwidth, you can fetch all the large files when you do a normal `git fetch`. However, most people don't, so you can tweak this to get only parts of the local file history that you're interested in.
Indeed this is a trade off that requires some centralization, but so does your proposed solution of a shallow clone. This adds some subtlety and configurability around that.
[+] [-] sytse|10 years ago|reply
[+] [-] hokkos|10 years ago|reply
[+] [-] vvanders|10 years ago|reply
Usually large files are binary blobs(PSD, .ma, etc) and it becomes incredibly easy to blow away someone's work by not pulling before every file you edit(or two people edit at the same time).
As much as some people hate Perforce that's exactly what they are setup to do. Plus their binary syncing algorithms are top-notch. We used to regularly pull ~300GB art repo(for gamedev) in ~20 minutes.
Git is great for code but this seems like square peg, round hole to me.
[+] [-] luckydude|10 years ago|reply
I believe his (her?) point is that for a very large class of binaries there is just no upside in parallel development, one guy is going squash the other guy's work. You want to serialize those efforts.
We don't have global locks yet but we know how to do them, just waiting for the right sales prospect to "force" us to do them. I'm 90% sure we could add them to BK in about a week.
[+] [-] icebraining|10 years ago|reply
[+] [-] jayd16|10 years ago|reply
[+] [-] sytse|10 years ago|reply
[+] [-] alkonaut|10 years ago|reply
[+] [-] bmurphy1976|10 years ago|reply
That's nonsense of course.
There are a lot of use cases where this would be very helpful without locking (i.e. jar/dlls).
This is useful now, locking can come later. We don't have to solve every conceivable problem all at once. More progress is made in small incremental steps than big bang leaps.
[+] [-] sytse|10 years ago|reply
[+] [-] zertrin|10 years ago|reply
[+] [-] nixgeek|10 years ago|reply
[+] [-] et1337|10 years ago|reply
[+] [-] sytse|10 years ago|reply
[+] [-] wscott|10 years ago|reply
http://www.bitkeeper.com/features_binary_asset_management_ba...
BAM works with a similar idea, instead of saving large files in the local repository, users are allowed to save them in a centralized server. This saves disk space and network transfer time.
However, unlike other solutions, BAM preserves the semantics of distributed development.
Instead of requiring a single or standardized set of servers, every user can have a different BAM server. Data is moved between servers automatically and on demand.
One group in an office might use a single BAM server for storing all their data close and locally. When another development group is started in India, they can use a server local to them. The binary assets will automatically transfer to the India server as commits are pulled between sites.
This allows centralized storage of your data and yet still supports having a team work while completely disconnected from the internet.
[+] [-] luckydude|10 years ago|reply
[+] [-] m0th87|10 years ago|reply
We wrote it because of various issues with tools at the time, basically boiling down to an inclination to have a dead simple solution.
I haven't tried lfs, but if it's anything like github's other software, then I'm sure it's substantially better than our tool.
[+] [-] m12k|10 years ago|reply
I agree that it should be a measure of last resort, but if you can't avoid working with big binary files, it makes the difference between a workflow that is a bit more cumbersome, and one that just grinds to a halt. Getting this functionality in git is great. And it'll mean a huge step forward in collaboration tools for game developers. You pretty much can't avoid big binary files when making games - and so far they've been stuck with SVN or Perforce (or the more adventurous ones are trying out PlasticSCM, which apparently is pretty nice too, but is proprietary and doesn't have a big ecosystem around it like git does). I hope this can lead to a boom of game developers using git.
[+] [-] babuskov|10 years ago|reply
[+] [-] Amorymeltzer|10 years ago|reply
[+] [-] entitycontext|10 years ago|reply
http://blogs.msdn.com/b/visualstudioalm/archive/2015/10/01/a...
[+] [-] rch|10 years ago|reply
https://www.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Diff
And perhaps GridFTP:
http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp...
[+] [-] res0nat0r|10 years ago|reply
[+] [-] thristian|10 years ago|reply
However git-fat[1] is an alternative system that works in much the same way, but lets you configure where the files are stored.
[1]: https://github.com/cyaninc/git-fat
[+] [-] kazinator|10 years ago|reply
[+] [-] mindprince|10 years ago|reply
And very glad to read that they decided to contribute to this instead of working on their own solution for the same problem. Kudos!
[+] [-] kannonboy|10 years ago|reply
[+] [-] elcritch|10 years ago|reply
There's still some warts (don't forget git lfs init after cloning!), buts it's mostly fast and transparent. I also ponied up $5 a month to get 50 gigs or so of lfs storage. Decent deal imho.
[+] [-] chejazi|10 years ago|reply
[+] [-] lspears|10 years ago|reply
[+] [-] k__|10 years ago|reply
I have data that belongs to my source, but is rather big and I want it inside of my repo.
[+] [-] m12k|10 years ago|reply
[+] [-] tyoverby|10 years ago|reply
[+] [-] deevus|10 years ago|reply
[+] [-] micmcg|10 years ago|reply
[+] [-] anotherevan|10 years ago|reply
[+] [-] jonesetc|10 years ago|reply
[+] [-] rlegit|10 years ago|reply