top | item 13559807

(no title)

>the Windows codebase has over 3.5 million files and is over 270 GB in size

So instead of cleaning up this mess they decided to "fix" git? With this type of thing going on at MS it's no wonder Windows 10 is more buggy than my Linux box.

Also why would they keep the entirety of "Windows" in one git repo? The only reason I can think to do this is if very large parts of the ecosystem are so tightly coupled together, that they depend on each other for compilation. I know it's not UNIX but any basic programming course teaches you to decouple your programs (and the parts of your programs) to make them not dependent on each other. Is the developer of explorer.exe expected to clone the whole Windows repo? Do they have no idea what they're doing? If they seriously have one monolithic code base then that really explains a lot.

Sounds like it's amateur hour turned up to 11 over at Microsoft.

discuss

bndr|9 years ago

It's a normal practice, google does it also: "The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence. The repository contains 86TBa of data, including approximately two billion lines of code in nine million unique source files." [1]

[1] http://cacm.acm.org/magazines/2016/7/204032-why-google-store...

testUser69|9 years ago

>The Git community strongly suggests and prefers developers have more and smaller repositories. A Git-clone operation requires copying all content to one's local machine, a procedure incompatible with a large repository.

They aren't copying all the files like you do with Git. They have a custom set up that sounds like it lets you checkout just the parts you need. I don't have time to read the whole thing, but it sounds like it works by breaking down a "super repo" into small "sub repos". This actually makes sense.

There is no way working with a 300gb git repo is fun or efficient, and they've probably been doing that for years at Microsoft.

justinmk|9 years ago

GVFS, or something like it, is an important development for git. Facebook already implemented something similar for Mercurial[1]. At a glance, GVFS hints that git has become more extensible than it used to be.

The merits of a "monorepo" have been hashed out previously[2], it's more nuanced than "lol, M$".

[1] https://code.facebook.com/posts/218678814984400/scaling-merc...

[2] http://danluu.com/monorepo/

maybe_someday|9 years ago

Hmmm.. Both Facebook and Google have giant monolithic repositories and use trunk based development and i'm guessing you wouldn't say they do it wrong. So not surprised Windows does

koolba|9 years ago

This is the Internet. There's always someone who'll say "they're doing it wrong!"

cwyers|9 years ago

To be fair, every time we have a thread on HN about how Google or Facebook do version control, there are people who leap at the chance to say they're doing it wrong.

sametmax|9 years ago

I wouldn't do that way either, but they are not the only one doing that. Google and facebook have a similar approach:

https://www.quora.com/Why-does-Facebook-have-so-much-of-thei...

https://www.wired.com/2015/09/google-2-billion-lines-codeand...

They may know something we don't :)

klodolph|9 years ago

I guess Microsoft, Google, Facebook, and Twitter are all run by amateurs, to name a few.

becarefulyo|9 years ago

It works fine. We have really good code search tools.

The numbers include test code, utilities, two entire web browsers, UI frameworks, etc.

ska|9 years ago

There is a reason you can only think of one reason for this - it's known as inexperience.

therealmarv|9 years ago

It seems you have no clue how very big companies work with source code management. Google e.g. is using for the most part only ONE single repository for their source code. So would you also say the same about them?

stephenr|9 years ago

Yes?

Big companies, even successful companies aren't incapable of making, and continuing to make stupid decisions.

I'm sure each company has a reason for using a single massive repo. I doubt I would agree with their reason, but I'm sure they have one.

dstaheli|9 years ago

It aint easy to transition from a significant code base in centralized source control to distributed over night, yet the benefits of distributed are desired.

olkid|9 years ago

> tightly coupled together

systemd

oneplane|9 years ago

This is exactly what I was thinking when I read the article. There is no reason for any Git repo to be that big. It's not a bug in Git, but more like a reasonable limit... if your project exceeds it, you're doing it wrong.

I'm also curious as to how they used to do it without git, maybe using TFS? I wonder what the timings on that were.

Anyway, I don't think GVFS is the way to go, and I hope that it either doesn't get accepted or doesn't play a role outside of Windows. It's good to see more Git usage, but hacking away instead of fixing the problematic project seems somewhat idiotic. I can imagine other tools having problems with a single project that size, are they going to hack those as well?

BugsJustFindMe|9 years ago

Ah, yes. The mating call of the wild elite computer expert.

Microsoft: "We, only one of most technologically advanced companies with only the 2nd or 3rd highest market cap of any public company on the planet depending on the day, had a problem with infrastructure trying to manage possibly the largest software project that anyone has ever made. And then we solved it."

You: "Stop doing what you're doing and pay attention to meeeeeeeeeeee."

klodolph|9 years ago

The problem with this argument is that it doesn't really explain why large repos are wrong. It turns out the choice between monorepo and multi repo does not have a one-size-fits-all answer. Microsoft can make their own tooling, and internal tools don't have to run outside of Microsoft so they're easier to develop.

Google reportedly used to use Perforce for their monolithic codebase, and Facebook is supposed to use Mercurial with a bunch of modifications. They all have huge code bases mostly in one repo (I've heard Facebook had a >50GB Git repository, and Google's codebase is supposed to be in the TB range).

ctdonath|9 years ago

"640k is enough for anyone."

CJefferson|9 years ago

I don't see why there should be limits.

I recently ran some experiments, one file per experiment, one result file per experiment. At around 100,000 files, git started getting very upset. Why shouldn't I be allowed to have 100,000 files, or a million files, in a directory? Why should it be my job, as a user, to manually rearrange my data into a format my computer is happier with?

garysieling|9 years ago

Being able to use git for things it's not designed for would be a strength, e.g. it works great as a database for certain kinds of projects.

I put all the records for https://www.findlectures.com in it, because then I can use diff tools for testing changes. Obviously this is nowhere near the size of the Windows codebase, but I could see a world where GVFS would be helpful for collaboration on this project.

unknown|9 years ago

[deleted]