Announcing GVFS: Git Virtual File System

[+] greg7mdp|9 years ago|reply

This is similar to what Google uses internally. See http://cacm.acm.org/magazines/2016/7/204032-why-google-store...:

"Most developers access Piper through a system called Clients in the Cloud, or CitC, which consists of a cloud-based storage backend and a Linux-only FUSE13 file system. Developers see their workspaces as directories in the file system, including their changes overlaid on top of the full Piper repository. CitC supports code browsing and normal Unix tools with no need to clone or sync state locally. Developers can browse and edit files anywhere across the Piper repository, and only modified files are stored in their workspace. This structure means CitC workspaces typically consume only a small amount of storage (an average workspace has fewer than 10 files) while presenting a seamless view of the entire Piper codebase to the developer."

This is a very powerful model when dealing with large code bases, as it solves the issue of downloading all the code to each client. Kudos to Microsoft for open sourcing it, and under the MIT license no less.

[+] krupan|9 years ago|reply

Holy cow, it sounds like they reinvented Clearcase!

[+] kentonv|9 years ago|reply

Google's Piper is impressive (I used it), but it emulates Perforce. Having something Git-based is a lot more exciting. Hope someone ports it to platforms other than Windows...

[+] general_ai|9 years ago|reply

Google is far more advanced than this. They have one giant monorepo (Piper) that's backed by Bigtable (or at least it was, when I was there). Piper was mostly created in response to Perforce's inability to scale and be fault tolerant. Until Piper came along, they would have to periodically restart The Giant Perforce Server in Mountain View. Piper is 24x7x365 and doesn't need any restarts at all. But the key bit here is not Piper per se. Unlike Microsoft, Google also has a distributed, caching, incremental build system (Blaze), and a distributed test system (Forge), and they are integrated with Piper. The vast majority of the code you depend on never actually ends up on your machine. Thanks to this, what takes hours at Microsoft takes seconds at Google. This enables pretty staggering productivity gains. You don't think twice about kicking off a build, and in most cases no more than a minute or two later you have your binaries, irrespective of the size of your transitive closure. Some projects take longer than that to build, most take less time. Tests are heavily parallelized. Dependencies are tracked (so tests can be re-run when dependencies change), there are large scale refactoring tools that let you make changes that affect the entire monorepo with confidence and without breaking anyone.

Google's dev infra is pretty amazing and it's at least a decade ahead of anything else I've seen. Every single ex-Googler misses it quite a bit.

[+] chokolad|9 years ago|reply

There is a discussion thread on r/programming, where MS folks, who implemented this answer questions. A lot of questions like why not use multiple repos, why not git-lfs, why not git subtree, etc. are answered there

https://www.reddit.com/r/programming/comments/5rtlk0/git_vir...

[+] stinos|9 years ago|reply

Thanks for bringing this up, it was actually a more interesting read than this thread. Less trolling, more facts and also interesting to read stuff I didn't happen to know. Like

One of the core differences between Windows and Linux is process creation. It's slower - relatively - on Windows. Since Git is largely implemented as many Bash scripts that run as separate processes, the performance is slower on Windows. We’re working with the git community to move more of these scripts to native cross-platform components written in C, like we did with interactive rebase. This will make Git faster for all systems, including a big boost to performance on Windows.

[+] tambourine_man|9 years ago|reply

It's interesting how all the cool things seem to come from Microsoft these days.

I still think we need something better than Git, though. It brought some very cool ideas and the inner workings are reasonably understandable, but the UI is atrociously complicated. And yes, dealing with large files is a very sore point.

I'd love to see a second attempt at a distributed version control system.

But I applaud MS's initiative. Git's got a lot of traction and mind share already and they'd probably be heavily criticized if they tried to invent its own thing, even if it was open sourced. Will take a long time to overcome its embrace, extend and extinguish history.

[+] Analemma_|9 years ago|reply

> I still think we need something better than Git, though. It brought some very cool ideas and the inner workings are reasonably understandable, but the UI is atrociously complicated. And yes, dealing with large files is a very sore point.

Note that Google and Facebook ran into the same problems Microsoft did, and their solution was to use Mercurial and build similar systems on top of it. Microsoft could've done that too, but instead decided to improve Git, which deserves some commendation. I'd rather Git and hg both got better rather than one "taking over".

[+] verytrivial|9 years ago|reply

> It's interesting how all the cool things seem to come from Microsoft these days.

I've assumed Microsoft have been making all this stuff all along, but keeping it internal then throwing it away on the probably false assumption that every bit of it is some sort of competitive advantage. I think they're coming around to the idea that at least appearing constructive and helpful to the developer community will help with trying to hire good developers.

[+] sytse|9 years ago|reply

Maybe something that has the data models of git but has a more consistent interface? Today on Git Merge there was a presentation about http://gitless.com/

For example one of the goals is to always allow you to switch branches. Stash and stash pop would happen automatically and it would even work if you're in the middle of a merge.

[+] coherentpony|9 years ago|reply

> I'd love to see a second attempt at a distributed version control system.

Out of curiosity, why a whole new attempt? Personally, I'd prefer the approach of "making our current tools better."

[+] ska|9 years ago|reply

   I'd love to see a second attempt at a distributed version control system.

Git wasn't the first, and even then had several contemporaries at 2nd gen.

[+] blunte|9 years ago|reply

Indeed, but I also had a pause when I considered how heavily Microsoft depended on a system originally built by Linus Torvolds :)

[+] ma2rten|9 years ago|reply

Mercurial is very similar to git but more user friendly.

[+] Florin_Andrei|9 years ago|reply

> It's interesting how all the cool things seem to come from Microsoft these days.

It's like a whole'nother company after they got rid of Steve Ballmer.

[+] greyman|9 years ago|reply

> but the UI is atrociously complicated

Linus himself admitted that he isnt good at UI. Anyway, I think git just wasnt designed to be used directly, but via another UI. For example, I use it within Visual Studio Code, and that covers about 90 percent of usecases, and then Git Extensions can take care of almost everything else. Sometimes cli is needed, though.

[+] unknown|9 years ago|reply

[deleted]

[+] unknown|9 years ago|reply

[deleted]

[+] aairey|9 years ago|reply

What a change of cash-cow placement can do ...

[+] gvb|9 years ago|reply

Using git with large repos and large (binary blob) files has been a pain point for quite a while. There have been several attempts to solve the problem, none of which have really taken off. I think all the attempts have been (too) proprietary – without wide support, it doesn’t get adopted.

I'll be watching this to see if Microsoft can break the logjam. By open sourcing the client and protocol, there is potential...

Other attempts:

* https://github.com/blog/1986-announcing-git-large-file-stora...

* https://confluence.atlassian.com/bitbucketserver/git-large-f...

Article on GitHub’s implementation and issues (2015): https://medium.com/@megastep/github-s-large-file-storage-is-...

[+] cies|9 years ago|reply

I think Joey Hess' attempt at "solving the problem" deserves a mention.

It is open source (GPLV3) licensed. [not proprietary]

Written in Haskell. [cool aid]

Currently has 1200+ stars on Github and is part of at least Ubuntu (http://packages.ubuntu.com/search?keywords=git-annex) since 12.04. [shows something for support and adoption]

edit: Link to Github https://github.com/joeyh/git-annex -- thanks dgellow

[+] EuAndreh|9 years ago|reply

git-annex is, IMHO, by far the best solution.

Pros of git-annex:

- it is conceptually very simple: use symlinks instead of ad-hoc pointer files, virtual files system, etc. to represent symbolic pointer that point to the actual blob file;

- you can add support for any backend storage you want. As long as it support basic CRUD operations, git-annex can have it as a remote;

- you can quickly clone a huge repo by just cloning the metadata of the repo (--no-content in git-annex) and just download the necessary files on-demand;

And many other things that no other attempt even consider having, like client-side encryption, location tracking, etc.

[+] vvanders|9 years ago|reply

That still only solves half the problem with large binary blobs.

The other half is that almost all of the binary formats can't be merged and so you need a mechanism to lock them to prevent people from wiping out other people's changes. Unfortunately that runs pretty much counter the idea of DCVS.

[+] kentt|9 years ago|reply

It's disappointing that all the comments are so negative. This is a great idea and solves a real problem for a lot of use cases.

I remembering years ago Facebook says it had this problem. A lot of the comments were centered around that you could change your codebase to for what git can do. I'm glad there's another option now.

[+] mox1|9 years ago|reply

Yes they did. They choose to scale out Mecurial to solve their problem. Wonder if they still use Mercurial?

https://code.facebook.com/posts/218678814984400/scaling-merc...

[+] outcoldman|9 years ago|reply

Also don't think that this is a good idea. Git is a Distributed Version Control https://en.wikipedia.org/wiki/Distributed_version_control, the main benefit of which is "allows many software developers to work on a given project without requiring them to share a common network". Seems like with GVFS they are making DVC to be a CVS (https://en.wikipedia.org/wiki/Concurrent_Versions_System) again. What is the point? There are a lot of good CVS systems around. They just to give cool kids access to cool tools? I believe there are plenty bridges between CVS and git already implemented, which also allows you to checkout only part of the CVS tree.

At Splunk we had the same problem, our source code was stored in CVS (perforce), but we wanted to switch to git. And not only because we really wanted to use git, but to simplify our development process, mainly because of the much easier branching model (lightweight branching also is available in perforce, but to get it we still needed to do some upgrades on our servers). We also had a problem that at the beginning we had very large working tree, don't think it was 200-300Gb, I believe it was 10x less, and actually required 4-5 seconds for git status. This was not appropriate for us, so we worked on our source code and release builds to split it in several git repos to make sure that git status will take not more than 0.x seconds.

My point is use right tools for right jobs. 4-5 seconds for git status is still a huge problem, I would prefer to use CVS instead if that will not require me to wait 5 seconds for each git status invocation.

[+] youdontknowtho|9 years ago|reply

I was actually surprised that there was only as much negative sentiment as there is. Microsoft could cure cancer and the post to HN would be mostly negative. It's tribal. It doesn't even matter what they do at this point.

That being said, you can see more and more people getting off the "Microsoft is evil" train. It's super slow and every bone headed thing that Microsoft does resets the needle for lots of people.

I've always been surprised how much sympathy a company like IBM or Intel gets on HN. They both sue people over patents. That both contribute to non-free software. They were early backers of Linux, though, and that is what people care about superficially.

[+] ajross|9 years ago|reply

> This is a great idea and solves a real problem for a lot of use cases.

I don't know if "a lot" is the right qualifier. Solitary repos of millions of files have scalability problems even outside the source control system (I mean: how long does it take your workstation to build that 3.5 million-file windows tree?)

A full Android system tree is roughly the same size and works fine with git via a small layer of indirection (the repo tool) to pull from multiple repositories. A complete linux distro is much larger still, and likewise didn't need to rework its tooling beyond putting a small layer of indirection between the upstream repository and the build system.

Honestly I'd say this GVFS gadget (which I'll fully admit is pretty cool) exists because Microsoft misapplied their source control regime.

[+] anon987|9 years ago|reply

It's because the 'problem' it solves is a corner case that's rarely encountered. I love their absurd examples of repos that take 12 hours to download. How many people have that problem, really?

All they did is create a caching layer.

[+] wyldfire|9 years ago|reply

I'm immediately reminded of MVFS and clearcase. Lots of companies still use clearcase, but IMO it's not the best tool for the job. git is superior in most dimensions. From what this article says, it's not quite the same as clearcase but there's certainly some hints of similarities.

The biggest PITA with clearcase was keeping their lousy MVFS kernel module in sync with ever-advancing linux distros.

I really liked Clearcase in 1999, it was an incredible advancement over other offerings then. MVFS was like "yeah! this is how I'd design a sweet revision control system. Transparent revision access according to a ranked set of rules, read-only files until checked out." But with global collaborators, multi-site was too complex IMO. And overall, clearcase was so different from other revision control systems that training people on it was a headache. Performance for dynamic views would suffer for elements whose vtrees took a lot of branches. Derived objects no longer made sense -- just too slow. Local disk was cheap now, it got bigger much faster than object files.

> However, we also have a handful of teams with repos of unusual size! ... You can see that in action when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.

This seems like a way-out-there use case, but it's good to know that there's other solutions. I'd be tempted to partition the codebase by decades or something.

[+] tcbawo|9 years ago|reply

I used Clearcase (on Solaris) in 1999 and was not a fan. It slowed our build times by at least 10x. I'm sure it was probably set up wrong, but this was a Fortune 100 company with lots of dedicated resources.

[+] dewyatt|9 years ago|reply

I think they could have picked a name that doesn't conflict with GNOME Virtual File System (GVfs).

[+] daigoba66|9 years ago|reply

The article doesn't directly say it, but are they migrating the Windows source code repository to git? That seems like a big deal.

I seem to recall that Microsoft has previously used a custom Perforce "fork" for their larger code bases (Windows, Server, Office, etc.).

[+] Ericson2314|9 years ago|reply

If I understand this correctly, unlike git-annex and git lfs, this not about extending the git format with special large files, but changing the algorithm for the current data format.

A custom filesystem is indeed the correct approach, and one that git itself should have probably supported long ago. In fact, there should really only be one "repo" per machine, name-spaced branches, and multiple mountpoints a la `git worktree`. In other words there should be a system daemon managing a single global object store.

I wonder/hope IPFS can benefit from this implementation on Windows, where FUSE isn't an option.

[+] manojlds|9 years ago|reply

The blog post does mention that some changes have been made to git (in their fork)

[+] hoov|9 years ago|reply

This is pretty big news. I know that when I was at Adobe, the only reason that Perforce was used for things like Acrobat, is because it was simply the only source control solution that could handle the size of the repo. Smaller projects were starting to use Git, but the big projects all stuck with Perforce.

[+] kevincox|9 years ago|reply

I love this approach. From working at Google I appreciate the virtual filesystem, it makes a lot of things a lot easier. However all my repos are large enough to fit on a single machine so I wish there was a mode where it was backed by a local repository, however the filesystem allows git to avoid tree scans.

Basically most operations in git are O(modified files) however there are a few that are O(working tree size). For example checkout and status were mentioned by the article. However these operations can be made to O(modified) files if git doesn't have to scan the working tree for changes.

So pretty much I would be all over this if:

- It worked locally.

- It worked on Linux.

Maybe I'll see how it's implemented and see if I could add the features required. I'm really excited for the future of this project.

[+] rethab|9 years ago|reply

Assuming that the repo was this big in the beginning, I wonder why the ever migrated to git (I'm assuming they did, because they can tell how long it takes to checkout). At least when somebody "tries" do the migration, wouldn't they realize that maybe git is not the right tool for them? Or did they actually migrate and then work with "git status" that take 10 minutes for some time until they realize they may need to change something?

Also, it would have been interesting if the article mentioned whether they tried other approaches taken by facebook (mercurial afaik) or google.

[+] xearl|9 years ago|reply

To me it sounds like these numbers are from a migration-in-progress. So they are trying, but instead of giving up and saying "not the right tool for us" they are trying to improve the tool.

[+] becarefulyo|9 years ago|reply

Because of the productivity benefits of using public tools instead of internal ones. Devs are more familiar with them, more documentation and examples, morale benefit because skills are transferable to other jobs, etc.

[+] imron|9 years ago|reply

> repos of unusual size

Sounds like they've almost solved the secrets of the fire swamp!

[+] krallja|9 years ago|reply

Repos of Unusual Size? I don't think they exist.

[+] olkid|9 years ago|reply

They can live there quite happily for some time.

[+] rbanffy|9 years ago|reply

Did they really need to make a name collision?

https://en.wikipedia.org/wiki/GVfs

[+] Navarr|9 years ago|reply

This sounds like a solid use case and a solid extension for that use case - but definitely not the end-all-be-all.

For one, it's not really distributed if you're only downloading when you need that specific file.

But that doesn't change the merrits of this at all, I think.

[+] cafebabbe|9 years ago|reply

My sysadmin: "we won't switch to git because it can't handle binary files and our code base is too big"

Our whole codebase is 800MB.

[+] yakk0|9 years ago|reply

I appreciated the Princess Bride reference with "repos of unusual size"

[+] 0X1A|9 years ago|reply

Just to make sure I have this right, this has to do with the _amount_ of files in their repo and not the _size_ of the files? So projects like git annex and LFS would not help the speed of the git repos?

[+] OJFord|9 years ago|reply

> when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.

How on Earth can anybody work like that?

I'd have thought you may as well ditch git at that point, since nobody's going to be using it as a tool, surely?

    git commit -m 'Add today\'s work - night all!' && git push; shutdown

[+] mortdeus|9 years ago|reply

Or how about we start some compartmentalizing your codebase so that you can like. You know, organize your code and restore sanity to the known universe.

I think when the powers that be said that whole thing about geniuses and clutter, they were specifically talking about their living spaces and not their work...

274 comments