Facebook hit git performance issue on large repository

[+] bos|14 years ago|reply

Facebook engineer here, working on this problem with Joshua.

What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.

A few examples:

* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.

* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.

An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)

* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), and the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).

None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)

[+] groby_b|14 years ago|reply

An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash

So does, presumably, the cache when you use lstat. (Let's scratch presumably. It does. Bonus points if you can't use Linux and use an OS that seems to chill its caches down as soon as possible. )

I hope I'm wrong, but the proper solution to this seems to be a custom file system - not only will it allow you to more easily obtain a "modified since" list of files, it also allows you to only get local files "on demand". (E.g. http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...)

That still doesn't solve the data structure issues in git, but at least it takes some of the insane amount of I/O off the table.

I'm looking forward to see what you guys cook up :)

[+] dochtman|14 years ago|reply

Heh, a familiar name working on this!

So for your first item, it seems like it should be possible to add a (mostly immutable) cache file doing the job of Mercurial's files field in changesets, right? I.e. for each commit, list the files changed. Should be more efficient than searching through trees/manifests for changed files, at least.

For large (in files) trees, it seems like there's no easy solution, except for developing some kind of subtree support. However, that's similar to just splitting up the repository (along the lines of hg subrepo support), in the sense that now you have no real verification that non-checked-out parts of the tree will work with the changes in the part you do have checked out.

Still, the inotify daemon seems like it could alleviate things a bunch; particularly if the repository is on a server anyway, i.e. it's not rebooted that often.

[+] Terretta|14 years ago|reply

> waiting on disk I/O

Out of curiosity, why are these benchmarks using regular disk and flash disk? At only 15 GB, what happens using ram disk? Sure SSD is fast, but for these things it's still really slow.

[+] joelthelion|14 years ago|reply

I see you've given the issue a lot of thought.

Sorry for asking the obvious, but do you really need huge amount of data to keep development productive? How often do you use history that is several years old? Could you not archive it?

Or is the sheer number of files the problem, even ignoring history?

This is not a "git is perfect, fix your workflow" post, but I'm genuinely interested in what you have to say. Also, it seems like making git faster is a increasingly difficult task, given the amount of effort that has already been put into it.

[+] tonfa|14 years ago|reply

Nice to see you're still in the DVCS business ;)

[+] willvarfar|14 years ago|reply

Do you think adapting git to use, say, LevelDB and letting that do its job with incremental updates and maintaining a secondary index by path could help?

And that might dovetail nicely with an inotify daemon?

[+] unknown|14 years ago|reply

[deleted]

[+] DrCatbox|14 years ago|reply

You seem to know enough about this problem to solve it given enough time.

Why doesnt facebook solve this git on huge repos problem and put out a patch for others to see? Oh, right, you want somebody else to solve the problem for you, for free!

[+] lbrandy|14 years ago|reply

Wow. I was expecting an interesting discussion. I was disappointed. Apparently the consensus on hacker news is that there exists a repository size N above which the benefits of splitting the repo _always_ outweigh the negatives. And, if that wasn't absurd enough, we've decided that git can already handle N and the repository in question is clearly above N. And I guess all along we'll ignore the many massive organizations who cannot and will not use git for precisely the same issue.

So instead of (potentially very enlightening conversation) identifying and talking about limitations and possible solutions in git, we've decided that anyone who can't use git because of its perf issues is "doing it wrong".

[+] kinofcain|14 years ago|reply

Your comment was at the top so I continued to read expecting to find a bunch of ignorant group think about how git is awesome and Facebook is dumb, but that's not really what's going on down below.

I don't know what facebook's use case is, so I have no idea if their repositories are optimally structured. However, I've used git on a very large repository and ran into some of the same performance issues that they did (30+ seconds to run git status), so I don't think it's terribly hard to imagine they're in a similar situation.

What we did to solve it is exactly what you're excoriating the people below for suggesting: we split the repos and used other tools to manage multiple git repos, 'Repo' in some situations, git submodules in others.

However, we moved to that workflow mainly because it had a number of other advantages, not just because it made day-to-day git operations faster.

I hope git gets faster, some of the performance problems described are things we saw too, but things are always more complicated and I see nothing below that looks like the knee-jerk ignorant consensus you're describing.

Sometimes the answer to "it hurts when I do this" is "don't do that... because there's other ways to solve the same issue that work better for a number of other reasons and we haven't bothered fixing that particular one because most of the time the other way works better anyway."

[+] lincolnq|14 years ago|reply

I had the same reaction as you. </meta>

Stat'ing a million files is going to take a long time. Perforce doesn't have this problem because you explicitly check out files (p4 edit). (Perforce marks the whole tree read-only, as a reminder to edit the file before you save.)

It seems like large-repo git could implement the same feature. You would just disable (or warn) for operations which require stat'ing the whole tree.

Then the question is how to make the rest of the operations perform well -- git add taking 5-10 seconds seems indicative of an interesting problem, doesn't it?

[+] akkartik|14 years ago|reply

You're right.

I found the original email equally disappointing, though. It boils down to "We pushed the envelope on size, it's too slow, we'd like to speed it up." Well, duh.

He uses the word 'scalability' early in the email, but shows no indication that he knows what it means. I'd love to hear if different operations slow down at different rates as the repo accumulates commits. Do they scale linearly, sublinearly, or superlinearly as the repo grows? Are there step functions at which there's a sudden dramatic slowdown (ran out of RAM, etc.)?

[+] sek|14 years ago|reply

You have a point.

It is just surprising when git was designed for the Linux kernel and we all here have a Github mindset.

[+] robfig|14 years ago|reply

In violent agreement here.

Git and HG: 1. Require you to be sync'ed to tip before pushing. 2. Cannot selectively check out files.

The former means that in any reasonably sized team, you will be forced to sync 30 times a day, even if you are the only one editing your section of the source tree. The latter means that Joe who is checking in the libraries for (huge open source project) for some testing increases everyones repo by that much, forever, even if it's deleted later.

Needless to say, the universal response is that I'm doing it wrong. Perforce 4 life!

But seriously, it says that Google adopted Git for their repo --- does anyone know how they use it? I would expect them to want a linear history, but their teams are way too big to be able to have everyone sync'ed to tip to push...

[+] jamespo|14 years ago|reply

I would have liked this comment better if it came up with some solutions itself, although maybe it's not easy to solve?

[+] kenrik|14 years ago|reply

With that in mind it seems like there is a market for a git replacement for these huge repos.

[+] jrockway|14 years ago|reply

Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code, but also wasteful.

So often, the tool used to manage the central repository, which needs to cleanly handle a large codebase, is different from the tool developers use for day-to-day work, which only needs to handle a small subset. At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis. This model seems to scale fairly well; Google has a big codebase with a lot of reuse, but all my git operations execute instantaneously.

Many projects can "shard" their code across repositories, but this is usually an unhappy compromise.

People always use the Linux kernel as an example of a big project, but even as open source projects go, it's pretty tiny. Compare the entire CPAN to Linux, for example. It's nice that I can update CPAN modules one at a time, but it would be nicer if I could fix a bug in my module and all modules that depend on it in one commit. But I can't, because CPAN is sharded across many different developers and repositories. This makes working on one module fast but working on a subset of modules impossible.

So really, Facebook is not being ridiculous here. Many companies have the same problem and decide not to handle it at all. Facebook realizes they want great developer tools and continuous integration across all their projects. And Git just doesn't work for that.

[+] Splines|14 years ago|reply

At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis.

At MS we also use Perforce (aka Source Depot), and I've toyed with the idea of doing something similar. Have you found any guides for "gotchas" or care to share what you've learned going this route?

[+] thurn|14 years ago|reply

Facebook uses Subversion for its trunk, actually, and just gets developers to use git-svn. This issue is primarily a problem because git-svn is a lot more serious about replicating the true git experience (keep everything local) than Google's p4-git wrapper is. They really just need to be a little less religious about keeping everything local.

[+] unknown|14 years ago|reply

[deleted]

[+] pnathan|14 years ago|reply

> Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code,

Can you expand on this? I would love to talk more about the "well known" part, I've never run across it before. I am a maintainer (tools guy actually) of a hg repo with about 120 subrepos, and the whole approach with subrepos is something that we're not thrilled about. Oh, and if you want to communicate via email, I'd be up for that too.

[+] ramanujan|14 years ago|reply

This looks like it could be of assistance:

http://source.android.com/source/version-control.html

  Repo is a repository management tool that we built on top 
  of Git. Repo unifies the many Git repositories when 
  necessary, does the uploads to our revision control 
  system, and automates parts of the Android development 
  workflow. Repo is not meant to replace Git, only to make 
  it easier to work with Git in the context of Android. The 
  repo command is an executable Python script that you can 
  put anywhere in your path. In working with the Android 
  source files, you will use Repo for across-network 
  operations. For example, with a single Repo command you 
  can download files from multiple repositories into your 
  local working directory.

http://google-opensource.blogspot.com/2008/11/gerrit-and-rep...

  With approximately 8.5 million lines of code (not 
  including things like the Linux Kernel!), keeping this all 
  in one git tree would've been problematic for a few reasons:

  * We want to delineate access control based on location in the tree.
  * We want to be able to make some components replaceable at a later date.
  * We needed trivial overlays for OEMs and other projects who either aren't ready or aren't able to embrace open source.
  * We don't want our most technical people to spend their time as patch monkeys.

  The repo tool uses an XML-based manifest file describing 
  where the upstream repositories are, and how to merge them 
  into a single working checkout. repo will recurse across 
  all the git subtrees and handle uploads, pulls, and other 
  needed items. repo has built-in knowledge of topic 
  branches and makes working with them an essential part of 
  the workflow.

Looks like it's worth taking a serious look at this repo script, as it's been used in production for Android. Might allow splitting into multiple git repositories for performance while still retaining some of the benefits of a single repository.

[+] exDM69|14 years ago|reply

> Looks like it's worth taking a serious look at this repo script, as it's been used in production for Android. Might allow splitting into multiple git repositories for performance while still retaining some of the benefits of a single repository.

Stay away from Repo and Gerrit. I use them at work, and they make my life miserable.

Repo was written years ago when Git did not have submodules, a feature where you can put repositories inside repositories. Git submodules is far superior to Repo, and allows you to e.g. bisect the history of many repositories.

I'm hoping that Google comes to it's senses and starts phasing out Repo in favor of Git submodules in Android development.

[+] cdibona|14 years ago|reply

Basically, if you want to manage a large collection of git source repositories, you'll probably end up using Repo and Gerrit and piggybacking on the work of the android ecosystem (and beyond, gerrit is used all over the place now)

There really isn't another solution out there right now (at least not anything open source) for very large single repositories.

[+] jmccaffrey|14 years ago|reply

Having worked with repo professionally, I'm not a fan. You lose simple ability to track dependencies across repositories or even revert to a previous consistent point in time without diligent tagging. Even with good tags, restructuring your project setup and changing your repo manifest can still break your ability to go back in time.

[+] losvedir|14 years ago|reply

Huh, fascinating. git was initially created for the Linux kernel development, and I haven't heard of any issues there. Offhand I would have said, as a codebase, the Linux kernel would be larger and more complex than facebook, but I don't have a great sense of everything involved in both cases.

So what's the story here: kernel developers put up with longer git times, the kernel is better organized, the scope of facebook is more massive even than the linux kernel, or there's some inherent design in git that works better for kernel work than web work?

[+] zargon|14 years ago|reply

The linux kernel has an order of magnitude fewer files than Facebook's test repository (25,000 in version 2.6.27, according to http://www.schoenitzer.de/lks/lks_en.html) and only 9 million lines of text.

This is on the largish side for a single project, but if Facebook likes to keep all their work in single repo then it isn't too difficult to go way beyond those stats. Think of keeping all GNU projects in a single repo.

[+] marginalboy|14 years ago|reply

It isn't surprising if Facebook has a large, highly coupled code base. Given their reputation for tight timelines and maverick-advocacy, I'm continually surprised the thing works at all.

[+] cbs|14 years ago|reply

From the sounds of it facebook has a really, really big ball of highly coupled code.

[+] aidenn0|14 years ago|reply

The linux kernel is several orders of magnitude smaller. They are talking about 1.3 million files totalling nearly 10GB for the working tree. My kernel checkout has 39 thousand files totaling 489MB.

[+] yuvadam|14 years ago|reply

While I'd be interested in seeing this issue further unfold, just the prospect of a 1.3M-file repo gives me the creeps.

I'm not sure what the exact situation at Facebook is with this repository, but I'm positive that if they had to start with a clean slate, this repo would easily find itself broken up into at least a dozen different repos.

Not to mention the fact that if _git_ has issues dealing with 1.3M files, I wonder what other (D)VCS they're thinking of as an alternative that would be more performant.

[+] sek|14 years ago|reply

http://thread.gmane.org/gmane.comp.version-control.git/18977...

They keep every project in a single repo, mystery solved.

Edit:

> We already have some of the easily separable projects in separate repositories, like HPHP.

Yeah, because it makes no sense, it's C++. They probably use for everything PHP i assume then. Is there no good build management tool for it?

[+] julian37|14 years ago|reply

Somewhat off-topic, could somebody explain why

  echo 3 | tee /proc/sys/vm/drop_caches

rather than just

  echo 3 > /proc/sys/vm/drop_caches

Is it because the output to stdout lets you be extra sure that the right data was sent to the kernel?

I'm just wondering if this is an idiom with a deeper meaning that I'm not aware of.

EDIT: I'm guessing that when you run it in a script (without set -x), rather than on the command line, you can see in the log what it is you sent?

[+] pdw|14 years ago|reply

Because you can

    echo 3 | sudo tee /proc/sys/vm/drop_caches

but

    sudo echo 3 > /proc/sys/vm/drop_caches

won't work.

[+] jochu|14 years ago|reply

Aside from reasons you mentioned, I can imagine it being because it easily allows one to add a sudo or being habit because of it. For example:

  echo 3 | sudo tee /proc/sys/vm/drop_caches

Will allow you to write as root and

  sudo echo 3 > /proc/sys/vm/drop_caches

Will be a permission error. It executes the echo as root and the write as the user

[+] dblock|14 years ago|reply

Others have tried and keep throwing more and more smart people at the problem they just shouldn't have.

MSFT with Windows codebase that runs out of several labs. Crazy branching and merging infrastructure. They use source-depot, originally a clone of perforce.

Google with all their source code in one Perforce repo.

Facebook will be on perforce before we know it.

The solution is an internal Github, not one giant project.

[+] gokhan|14 years ago|reply

Large repos bring their own problems, and results in some design decisions accordingly. For example, Visual Studio itself is 5M+ files and this affected some of the the initial design decisions (Server side workspaces, for this example) when developing TFS 2005 (the first version) [1]. That decision suits MS but not the small to medium clients well. So they're now alternating that design with client side workspaces.

It's not wise to offer Facebook to split the repository. Looks like it's time to improve the tool.

[1] http://blogs.msdn.com/b/bharry/archive/2011/08/02/version-co...

[+] iamleppert|14 years ago|reply

I can believe this working with a former facebook employee. They do not believe in separating or distilling anything into separate repos. Why the fuck would you want to have a 15GB repo?

Ideally they should have many small, manageable repositories that are well tested and owned by a specific group/person/whatever. At least something small enough a single dev or team can get their head around.

Sheesh.

[+] dustingetz|14 years ago|reply

the obvious answer, repeatedly mentioned in comments:

> factor into modules, one project per repo

where i work we have a project with clear module boundaries, but all in the same repo. we have an "app" and some dependencies including our platform/web framework. none of these are stable, they're all growing together. Commits on the app require changes in the platform, and in code review it is helpful to see things all together. Porting commits across different branches requires porting both the application change and the dependent platform changes. Often a client-specific branch will require severe one-off changes so the platform may diverge -- it is not practical for us (right now) to continually rebase client branches onto the latest platform.

this is just our experience, not facebook's, but lets face it: real life software isn't black and white, and discussion that doesn't acknowledge this isn't particularly helpful.

[+] snprbob86|14 years ago|reply

We've experienced this.

We've got a superproject with our server configs, and sub projects for our background processing, API, and web-frontend respectively.

Often, each project can evolve and be versioned 100% independently. However, often you need to modify multiple projects and (especially with server config changes) coordinate changes via the super project.

It's a little hairy sometimes and often feels like unnecessary overhead, but the mental boundary is extremely valuable on it's own. Being able to add a field to the API and check that commit into the superproject for deployment before the front end features are done is nice. The social impact on implementation modularity is valuable. We write better factored code by letting Git establish a more concrete boundary for us.

[+] sek|14 years ago|reply

This is what git submodules are for, but when they can't use them they don't have clear module boundaries.

[+] djtriptych|14 years ago|reply

I hope these guys do take the route of developing a large-scale performant patch.

Git as so many interesting uses at scale as just a tool that navigates and tracks DAGs over time.

[+] courtewing|14 years ago|reply

This was actually pretty fascinating to me. On one hand, I am astonished at how long it takes to perform seemingly trivial git operations on repositories at this scale. On the other hand, I'm utterly mystified that a company like Facebook has such monolithic repositories. Even back when I was using SVN a lot, I relied on externals and such to break up large projects into their smaller service-level components.

I'd be very interested to see some benchmarks on their current VCS solution for repositories of this scale.

[+] jpdoctor|14 years ago|reply

$100B company, maybe they can afford to put some people onto solving this for the open software community (and put the solution into the open), especially since nobody else in the community seems to have this problem.

[+] redstone|14 years ago|reply

This is Joshua (who posted the original email). I'm glad to see so much interest in source control scalability. If there are others who have ever contemplated investing a bit of time to improving git, it'd be great to coordinate and see what makes sense to do - even if it turns out that the right answer is just to make the tools that manage multiple repos so good that it feels as easy as a single repo.

[+] lnguyen|14 years ago|reply

There's two issues: the width of the repository (number of files) and the depth (the number of commits).

Since "status" and "commit" perform fairly well after the OS file cache has been warmed up, that probably can be resolved by having background processes that keep it warm. (Also, how long would it take to just simply stat that number of files? )

The issue of "blame" still taking over 10 minutes: We need to know far back in the repository they're searching. What happen if there's one line that hasn't been changed since the initial commit? Are you being forced to go back to through the whole commit history?

How old is the repository? Years? Months? I'm probably guessing in the at least years range based on the number of commits (unless the developers are extremely commit-happy).

At a certain point, you're going to be better off taking the tip revisions off a branch and starting a fresh repository. It doesn't matter what SCM/VCS tool you're using (I've been the architect and admin on the implementation of a number of commercial tools). Keep the old repository live for a while and then archive it.

You'll find that while everyone wants to say that they absolutely need the full revision history of every project, you rarely go back very far (aka the last major release or two). And if you do need that history, you can pull it from the archives.

[+] pwpwp|14 years ago|reply

Git was designed for the Linux kernel, and it's simply not big: a couple thousand files, broken up into directories of dozens or hundreds of files.

http://www.schoenitzer.de/lks/lks_en.html#new_files

[+] teyc|14 years ago|reply

This is an interesting social AND a technical problem. The problem for FB is that it is all too easy for them to just fork git, create the necessary interfaces and then hope the git maintainers would accept it (they mightn't) or release it into the wild (and incur bad karma and wrath of OS developers who'd see this has schism or even heresy).

They've reached out to the developers on git, and I guess that's a first step.

[+] dpcx|14 years ago|reply

I don't want to imagine the actual kind of code that requires 1.3M files to run.

205 comments