top | item 22575775

Git partial clone lets you fetch only the large file you need

229 points| moyer | 6 years ago |about.gitlab.com

86 comments

order

beagle3|6 years ago

There is one note piece to the puzzle to make git perfect for every use case I can think of: store large files as a list of blobs broken down by some rolling hash a-la rsync/borg/bup.

That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.

git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.

Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.

pas|6 years ago

Do ISOs and other large blob types support only partial (block) modification? Wouldn't all subsequent blocks change too?

derefr|6 years ago

Has anyone used Git submodules to isolate large binary assets into their own repos? Seems like the obvious solution to me. You already get fine-grained control over which submodules you initialize. And, unlike Git LFS, it might be something you’re already using for other reasons.

jniedrauer|6 years ago

Using submodules require that everyone on your team has at least a vague idea of what's going on and how to not foot-gun themselves. That's hard enough with git itself. I don't think I've ever seen submodules used without become a major pain point.

matheusmoreira|6 years ago

The problem with git submodules is they can't be used like a hyperlink to another repository. Updating the submodule requires updating the superproject as well. The new commits are invisible to the superproject until that is done.

It'd be great if they worked like Python's editable package installations.

jfkebwjsbx|6 years ago

Submodules are almost always the wrong answer. If you need to version huge files, use Git LFS.

Ididntdothis|6 years ago

I have tried sub modules but it’s way too easy to shoot yourself in the foot. Not very sustainable in a team with different levels of git knowledge.

lettergram|6 years ago

I’ve done that. Especially if you want specific versions of data to build ML models, this makes a nice audit log for reproducibility

vvanders|6 years ago

Also known as workspace views in P4.

It's interesting to see the wheel reinvented. We used to run a 500gb art sync/200gb code sync with ~2tb back end repo back when I was in gamedev. P4 also has proper locking, it is really the is right tool if you've got large assets that need to be coordinated and versioned.

Only downside of course is that it isn't free.

dijit|6 years ago

> Only downside of course is that it isn't free.

Another downside is that it consumes insane resources (our servers are in the dozens of TiB of ram, with huge NVMe based storage arrays directly attached)

Another downside is that you have to maintain connection to p4 to do any VCS operations (stashing included).

Another downside is that branches are very "expensive" (often taking days) and are impossible to reconcile. We never re-merge to MAIN.

jasondclinton|6 years ago

This kind of comment isn't helpful. Of course, there have been ways to copy large files around since there were networks. What's new in this protocol enhancement is that this works within the context of a Merkle tree-based technology (upon which all DVCS's are based). To use your analogy, yes this is a wheel but it's built with rubber instead of wood and iron.

01100011|6 years ago

I love P4 for just working, but I absolutely can't stand the limited shelving ability. I ended up writing a helper program that lets me shuffle local changes off to a git repo just so I could manage working on several overlapping changelists. Perforce would be so much more usable if they would include this sort of basic functionality right out of the box. The thing git gets right is that you often need to juggle several threads of change at the same time, and those threads may have complex branching as you try out different approaches and combine the best pieces at the end.

jfkebwjsbx|6 years ago

P4 was great for its time if you could pay for it, but it is definitely not a competitor anymore.

Git LFS has been just fine for multiterabyte repositories for years.

jayd16|6 years ago

P4 is fine on paper. I just wish the the client didn't crash and the server didn't lock up as often as it does. P4V is a mess.

scarecrow112|6 years ago

This is interesting and could be a savior for Machine Learning(ML) engineering teams. In a typical ML workflow, there are three main entities to be managed: 1. Code 2. Data 3. Models Systems like Data Version Control(DVC) [1], are useful for versioning 2 & 3. DVC improves on usability by residing inside the project's main git repo while maintaining versions of the data/models that reside on a remote. With Git partial clone, it seems like the gap could still be reduced between 1 & 2/3.

[1] - https://dvc.org/

itroot|6 years ago

Also --reference (or --shared) is a good parameter to speed-up cloning (for build, for example), if you have your repository cached in some other place. I was using it a long time ago when I was working on system that required to clone 20-40 repos to build. This approach decreased clone timings by an order of magnitude.

mikepurvis|6 years ago

Do you actually need clones in that scenario? I worked on a build system that grabbed source from several hundred repos at the starting point, and it turned out to be way faster to just grab it all as tarballs with aria2c.

andrewshadura|6 years ago

Careful, with extra large repositories it actually slows down the cloning while, obviously, significantly reducing the space usage.

microtherion|6 years ago

That seems quite useful, though Git LFS mostly does the job.

One of my biggest remaining pain points is resumable clone/fetch. I find it near impossible to clone large repos (or fetch if there were lots of new commits) over a slow, unstable link, so almost always I end up cloning a copy to a machine closer to the repo, and rsyncing it over to my machine.

hinkley|6 years ago

What’s your take on this line?

> Partial Clone is a new feature of Git that replaces Git LFS and makes working with very large repositories better by teaching Git how to work without downloading every file.

shaklee3|6 years ago

This is great. We use get lfs extensively, and one of the biggest complaints we have is users have to clone 7GB of data just to get the source files. There's a work around in that you don't have to enter your username and password from the lfs repo, and let it timeout, but that's a kluge.

elephantum|6 years ago

There’s an option for that: GIT_LFS_SKIP_SMUDGE=1 git clone SERVER-REPOSITORY

danbolt|6 years ago

In the AAA games industry git has been a bit slower on the uptake (although that’s changing quickly) as large warehouses of data are often required (eg: version history of video files, 3D audio, music, etc.). It’s nice to see git have more options for this sort of thing.

Keverw|6 years ago

Surprised this new idea doesn’t support object storage. Sounds like Git LFS would still be the right way to go for repos with assets for games like meshes, sounds, etc.

However I’ve heard many studios use Perforce instead. However not being open source is a downside to some, but I don’t really know too much about it personally.

Then if working with a lot of non code files, sounds like some solutions have locking. I guess not two people could edit the same Blender or PSD file at the same time and then merge them later on.

Kinda wouldn’t surprise me if some companies actually run multiple versioning control systems. Code on one system, game assets on another.

jfkebwjsbx|6 years ago

Git LFS has been a thing for years, though.

jniedrauer|6 years ago

This could actually be a really good solution to the maximum supported size of a Go module. If you place a go.mod in the root of your repo, then every file in the repo becomes part of the module. There's also a hardcoded maximum size for a module: 500M. Problem is, I've got 1G+ of vendored assets in one of my repos. I had to trick Go into thinking that the vendored assets were a different Go module[0]. Go would have to add support for this, but it would be a pretty elegant solution to the problem.

[0]: https://github.com/golang/go/issues/37724

lima|6 years ago

That does sound like a "you're holding it wrong" issue. As one of the Go team members pointed out, defining a separate module is not a hack, but the intended way of doing it.

How would a partial checkout help?

krupan|6 years ago

I started a project recently and for the first time ever I've wanted to keep large files in my repo. I looked into git LFS and was disappointed to learn that it requires either third party hosting or setting up a git LFS server myself. I looked into git annex and it seems decent. This, once it is ready for prime time, will hopefully be even better

nikivi|6 years ago

Is it possible given a git repo (hosted on say GitHub) to only 'clone' (download) certain files from it? Without `.git`

bspammer|6 years ago

Short answer is, not easily: https://stackoverflow.com/a/14610427

You can get the most recent tree for a repository (no history, just the current state of the repo) with `git clone --depth=1`. That's often good enough for slow connections.

vicosity|6 years ago

I'm still unconvinced. Will this provide a user friendly approach to managing design assets.

madsbuch|6 years ago

My impression is that it will use the normal git experience managing design assets. Ie. with this there should be no need for additional tooling. If it works, that would be so great!

piliberto|6 years ago

> One reason projects with large binary files don't use Git is because, when a Git repository is cloned, Git will download every version of every file in the repository.

Wrong? There's a --depth option for the git fetch command which allows the user to specify how many commits they want to fetch from the repository

ComputerGuru|6 years ago

depth is broken; it cannot be used for submodules/recursive submodules dependably because most hosts will refuse to serve unadvertised refs. We learned this the hard way. Or maybe it is submodules that are broken. Or git itself.

shaklee3|6 years ago

depth just let you control the amount of history. It will not let you exclude files that are at the highest depth that you don't want. So while that statement was not accurate, it's not what this feature is intended for.

sewer_bird|6 years ago

Yes, but 95% of devs, even fairly talented ones, don't really know how to use Git.

smitty1e|6 years ago

In AWS, it's worth considering putting those large files in an S3 bucket.