top | item 11430009

Introducing DGit

262 points| samlambert | 10 years ago |githubengineering.com | reply

47 comments

order
[+] ianlevesque|10 years ago|reply
> Until recently, we kept copies of repository data using off-the-shelf, disk-layer replication technologies—namely, RAID and DRBD. We organized our file servers in pairs. Each active file server had a dedicated, online spare connected by a cross-over cable.

With all the ink spilled about creative distributed architectures, it's really humbling to see how far they grew with an architecture that simple.

[+] viraptor|10 years ago|reply
I'm not surprised really. I never worked at GH scale, but anyway learned early that simple solutions just work. Want a db cluster? Why not active-passive M-M instead. Want a M-M setup? Why not dual server and block replication.

Complicated things fail in complicated ways (looking at you, mysql ndb cluster), while simple solutions just work. They may be less efficient, but you'd better have a great use case for spending time on a new, fancy clustering solution - and even better idea how to handle it's state / monitoring / consistency.

[+] justinsb|10 years ago|reply
The interesting bit here will be how they reconcile potentially conflicting changes between the replicas. It is pretty easy to replicate the data, because git is content-addressable and can go garbage collection - I think even rsync would work. The challenge is that when e.g. the master branch is updated, you essentially have a simple key-value database that you must update deterministically. I look forward to learning how github chose to solve that challenge!
[+] LukeShu|10 years ago|reply
rsync would only work well if there aren't any pack files (which reduce the disk space used, and access times). Pack files break the simplicity of the basic content-addressed store by compressing multiple objects together.
[+] jvoorhis|10 years ago|reply
I'm not sure how GH solved this, but consistently updating refs in that simple k-v store seems to be the main challenge.

I wonder whether the receive-pack operation offers a natural boundary for transactions?

[+] lotyrin|10 years ago|reply
The replica count seems to be three - allowing quorum with a single lost host - and the repository goes read-only when quorum is lost.
[+] drewm1980|10 years ago|reply
I always imagined they had some kind of huge database holding all of the git objects to avoid excessive duplication. Now it sounds like they duplicate objects for every trivial fork, times three! I fork everything I am interested in just to make sure it stays available, and thought that was only costing github an extra file link each time...
[+] wereHamster|10 years ago|reply
You are wrong. Before they used a single git repository for all user repositories in a network. So the git objects in the original repository and all its forks were properly deduplicated. Deduplication across different networks does not make much sense.

Now, as far as overprovisioning, they had 4 times as much disk space provisioned as necessary: 2 disks in RAID in a single machine times 2 (hot spare). Now they only need 3x, for the three copies.

[+] idorosen|10 years ago|reply
Objects don't need to be duplicated for forks. See the objects/info/alternates file. They're talking about redundancy for the object storage backend, which is independent of how many forks/clones/repositories. In fact, assuming they solve collision detection somehow, they could store all objects for all repositories in one object store and have all repos be thin wrappers with objects/info/alternates files forwarding to that object store repository...
[+] rctay89|10 years ago|reply
Compare this approach with Google's - github sticks to the git-compatible 'loose' repository format on vanilla filestorage, while google uses packs on top of their BigTable infrastructure, and requires changes to repo-access at the git-level [1].

[1] https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...

Interesting how Github is sounding like Google and Amazon. They're probably hitting the scale where it makes sense to build internal APIs and infrastructure abstractions to support their operations, eg. Bigtable and S3. In fact, DGit sounds like another storage abstraction like Bigtable and S3, albeit limited - eg. a git repo must be stored fully on a single server (based on my cursory reading of github's description of DGit), but in Bigtable, data is split into tablets that comprise the table might be stored on different places, which would allow higher utilization of resources.

[+] venantius|10 years ago|reply
Is there any plan to open source this?
[+] Camillo|10 years ago|reply
"dgit" is not the best of names, since git itself is already distributed, as they note. It would have been more accurate to call it "replicated git", or "rgit". I guess they just wanted to be able to pronounce it "digit".
[+] newjersey|10 years ago|reply
I agree.

I'll add that a person who pronounces git as JIT is probably a git. Dgit sounds like the git more than d JIT.

[+] Artemis2|10 years ago|reply
Is this what's been causing the very poor availability of GitHub today?

https://status.github.com/messages

[+] eridius|10 years ago|reply
Unlikely. According to the article they've been rolling this out for months. It's not like they flipped a switch today to turn it on.
[+] simoncion|10 years ago|reply
off-the-wall (and highly unlikely) suggestion: GH unleashed an aggressive Chaos Monkey today for the purpose of testing DGit's reliability claims in production.
[+] ryao|10 years ago|reply
Not to try to make this sound less awesome than it is, what happens in a proxy fails? Are the proxies now the weak point in GitHub's architecture?
[+] mmckeen|10 years ago|reply
The design of this seems very similar to GlusterFS, which has a very elegant design. It just acts as a translation layer for normal POSIX syscalls and forwards those calls to daemons running on each storage host, which then reproduces the syscalls on disk. This seems like very much the same thing except using git operations.
[+] notacoward|10 years ago|reply
Thanks for the compliment. While I as a GlusterFS developer would like to see us get more credit for the ways in which we've innovated, this is not such a case. The basic model of forwarding operations instead of changed data was the basis for AT&T UNIX's RFS in 1986 and NFS in 1989. Even more relevantly, PVFS2 already did this in a fully distributed way before we came along. I'd like to make sure those projects' authors get due credit as well, for blazing the path that we followed.
[+] nwmcsween|10 years ago|reply
I don't understand why they just didn't use ceph. Ceph has all the features dgit was invented to solve.
[+] krakensden|10 years ago|reply
Just is the worst word in the English language.
[+] notacoward|10 years ago|reply
No, not really. I love my cousins over in Ceph-land, that's for sure, but asynchronous but fully ordered replication across data centers is not in their feature set. At the RADOS level replication is synchronous, so it's not going to work well with that kind of latency. At the RGW level it's async, but loses ordering guarantees that you'd need for something like this. Replicating the highest-level operations you can tap into - in this case git operations rather than filesystem or block level - is the Right Thing To Do.
[+] samuel1604|10 years ago|reply
is there a whitepaper actually showing how this works ?
[+] brazzledazzle|10 years ago|reply
I think your question may have been covered at the end:

>Over the next month we will be following up with in-depth posts on the technology behind DGit.

[+] glasz|10 years ago|reply
good lord. what a feat. tbh i'd have fucked this up.
[+] systems|10 years ago|reply
ok i dont get this ... shouldn't server availability problems be solved using traditional server availability technologies, like cloud technologies

why mess with git

[+] shadowmint|10 years ago|reply
Like what?

I'm going to tentatively suggest this is one of those 'hard' problems that throwing buzz words like 'cloud technologies' at doesn't solve.

What replication tech would you imagine solves this issue of distributing hundreds of thousands of constantly updated repositories?