> Until recently, we kept copies of repository data using off-the-shelf, disk-layer replication technologies—namely, RAID and DRBD. We organized our file servers in pairs. Each active file server had a dedicated, online spare connected by a cross-over cable.
With all the ink spilled about creative distributed architectures, it's really humbling to see how far they grew with an architecture that simple.
I'm not surprised really. I never worked at GH scale, but anyway learned early that simple solutions just work. Want a db cluster? Why not active-passive M-M instead. Want a M-M setup? Why not dual server and block replication.
Complicated things fail in complicated ways (looking at you, mysql ndb cluster), while simple solutions just work. They may be less efficient, but you'd better have a great use case for spending time on a new, fancy clustering solution - and even better idea how to handle it's state / monitoring / consistency.
The interesting bit here will be how they reconcile potentially conflicting changes between the replicas. It is pretty easy to replicate the data, because git is content-addressable and can go garbage collection - I think even rsync would work. The challenge is that when e.g. the master branch is updated, you essentially have a simple key-value database that you must update deterministically. I look forward to learning how github chose to solve that challenge!
rsync would only work well if there aren't any pack files (which reduce the disk space used, and access times). Pack files break the simplicity of the basic content-addressed store by compressing multiple objects together.
I always imagined they had some kind of huge database holding all of the git objects to avoid excessive duplication. Now it sounds like they duplicate objects for every trivial fork, times three! I fork everything I am interested in just to make sure it stays available, and thought that was only costing github an extra file link each time...
You are wrong. Before they used a single git repository for all user repositories in a network. So the git objects in the original repository and all its forks were properly deduplicated. Deduplication across different networks does not make much sense.
Now, as far as overprovisioning, they had 4 times as much disk space provisioned as necessary: 2 disks in RAID in a single machine times 2 (hot spare). Now they only need 3x, for the three copies.
Objects don't need to be duplicated for forks. See the objects/info/alternates file. They're talking about redundancy for the object storage backend, which is independent of how many forks/clones/repositories. In fact, assuming they solve collision detection somehow, they could store all objects for all repositories in one object store and have all repos be thin wrappers with objects/info/alternates files forwarding to that object store repository...
Compare this approach with Google's - github sticks to the git-compatible 'loose' repository format on vanilla filestorage, while google uses packs on top of their BigTable infrastructure, and requires changes to repo-access at the git-level [1].
Interesting how Github is sounding like Google and Amazon. They're probably hitting the scale where it makes sense to build internal APIs and infrastructure abstractions to support their operations, eg. Bigtable and S3. In fact, DGit sounds like another storage abstraction like Bigtable and S3, albeit limited - eg. a git repo must be stored fully on a single server (based on my cursory reading of github's description of DGit), but in Bigtable, data is split into tablets that comprise the table might be stored on different places, which would allow higher utilization of resources.
"dgit" is not the best of names, since git itself is already distributed, as they note. It would have been more accurate to call it "replicated git", or "rgit". I guess they just wanted to be able to pronounce it "digit".
off-the-wall (and highly unlikely) suggestion: GH unleashed an aggressive Chaos Monkey today for the purpose of testing DGit's reliability claims in production.
The design of this seems very similar to GlusterFS, which has a very elegant design. It just acts as a translation layer for normal POSIX syscalls and forwards those calls to daemons running on each storage host, which then reproduces the syscalls on disk. This seems like very much the same thing except using git operations.
Thanks for the compliment. While I as a GlusterFS developer would like to see us get more credit for the ways in which we've innovated, this is not such a case. The basic model of forwarding operations instead of changed data was the basis for AT&T UNIX's RFS in 1986 and NFS in 1989. Even more relevantly, PVFS2 already did this in a fully distributed way before we came along. I'd like to make sure those projects' authors get due credit as well, for blazing the path that we followed.
No, not really. I love my cousins over in Ceph-land, that's for sure, but asynchronous but fully ordered replication across data centers is not in their feature set. At the RADOS level replication is synchronous, so it's not going to work well with that kind of latency. At the RGW level it's async, but loses ordering guarantees that you'd need for something like this. Replicating the highest-level operations you can tap into - in this case git operations rather than filesystem or block level - is the Right Thing To Do.
[+] [-] ianlevesque|10 years ago|reply
With all the ink spilled about creative distributed architectures, it's really humbling to see how far they grew with an architecture that simple.
[+] [-] viraptor|10 years ago|reply
Complicated things fail in complicated ways (looking at you, mysql ndb cluster), while simple solutions just work. They may be less efficient, but you'd better have a great use case for spending time on a new, fancy clustering solution - and even better idea how to handle it's state / monitoring / consistency.
[+] [-] justinsb|10 years ago|reply
[+] [-] LukeShu|10 years ago|reply
[+] [-] jvoorhis|10 years ago|reply
I wonder whether the receive-pack operation offers a natural boundary for transactions?
[+] [-] lotyrin|10 years ago|reply
[+] [-] drewm1980|10 years ago|reply
[+] [-] siong1987|10 years ago|reply
Under "Your own fork of Rails", you will see how it actually works. The answer to your question is "no, they don't store 3 copies of the same repo".
[+] [-] wereHamster|10 years ago|reply
Now, as far as overprovisioning, they had 4 times as much disk space provisioned as necessary: 2 disks in RAID in a single machine times 2 (hot spare). Now they only need 3x, for the three copies.
[+] [-] idorosen|10 years ago|reply
[+] [-] petemill|10 years ago|reply
[+] [-] justinsb|10 years ago|reply
[+] [-] rctay89|10 years ago|reply
[1] https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...
Interesting how Github is sounding like Google and Amazon. They're probably hitting the scale where it makes sense to build internal APIs and infrastructure abstractions to support their operations, eg. Bigtable and S3. In fact, DGit sounds like another storage abstraction like Bigtable and S3, albeit limited - eg. a git repo must be stored fully on a single server (based on my cursory reading of github's description of DGit), but in Bigtable, data is split into tablets that comprise the table might be stored on different places, which would allow higher utilization of resources.
[+] [-] venantius|10 years ago|reply
[+] [-] Camillo|10 years ago|reply
[+] [-] newjersey|10 years ago|reply
I'll add that a person who pronounces git as JIT is probably a git. Dgit sounds like the git more than d JIT.
[+] [-] Artemis2|10 years ago|reply
https://status.github.com/messages
[+] [-] eridius|10 years ago|reply
[+] [-] simoncion|10 years ago|reply
[+] [-] ryao|10 years ago|reply
[+] [-] mmckeen|10 years ago|reply
[+] [-] notacoward|10 years ago|reply
[+] [-] nwmcsween|10 years ago|reply
[+] [-] krakensden|10 years ago|reply
[+] [-] notacoward|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] samuel1604|10 years ago|reply
[+] [-] brazzledazzle|10 years ago|reply
>Over the next month we will be following up with in-depth posts on the technology behind DGit.
[+] [-] glasz|10 years ago|reply
[+] [-] systems|10 years ago|reply
why mess with git
[+] [-] shadowmint|10 years ago|reply
I'm going to tentatively suggest this is one of those 'hard' problems that throwing buzz words like 'cloud technologies' at doesn't solve.
What replication tech would you imagine solves this issue of distributing hundreds of thousands of constantly updated repositories?
[+] [-] wmf|10 years ago|reply
[+] [-] rlpb|10 years ago|reply
Please address this before creating future hell for distributions.
[+] [-] douglasfshearer|10 years ago|reply