Rearchitecting GitHub Pages

[+] kyledrake|10 years ago|reply

Pretty similar to how Neocities serves static sites (https://neocities.org).

There's a few differences. We don't use SQL in the routing chain, we use regex to pick out the site name and then serve from a directory of the same name (this is NOT as bad as it sounds, most filesystems can do this quite well now and take MUCH more than half a million sites to bottleneck).

DRBD is also a little hardcore for my tastes. Nothing wrong with it, I just don't know it well, and I don't like being dependant on things I don't know how to debug.

An alternative I wanted to show uses inotify, rsync and ssh combined into a simple replication daemon. It's obviously not as fast, but if you enable persistent SSH connections, it's not too bad. If it screws up, you can just run rsync. Rumor has it the Internet Archive uses an approach not too far away from this for Petabox. Check it out if you're looking for something a little more lightweight for real-time replication: https://code.google.com/p/lsyncd/

We're still working on open sourcing (!) our serving infrastructure, so eventually you will be able to see all of the code we use for this (sans the secrets and keys, of course). I've just been having trouble coming up with a good solution for doing this. For now, enjoy the source of our web app: https://github.com/neocities/neocities

[+] haileys|10 years ago|reply

> Nothing wrong with it, I just don't know it well, and I don't like being dependant on things I don't know how to debug.

Yep, this is basically our approach as well.

We've been using DRBD for quite a long time now on our Git fileservers (which also run in active/standby pairs - in fact, they look a lot like our Pages fileservers) so we have quite a lot of in-house experience with it and it's a technology we're pretty comfortable with. Given this, using it for the new Pages infrastructure was a pretty straight-forward decision.

[+] kyledrake|10 years ago|reply

Here's our current nginx config on the proxy server. I've got the DDoS psuedo-protection (there's another layer upstream) and caching turned off right now because we're working on something, but this is basically it:

https://kyledrake.neocities.org/misc/nginx.conf.txt

Critique away. As you can see, we've just barely avoided pulling out the lua scripting.

The next step for me would probably be to write something in node.js or Go. There's probably a lot of people cringing at that thought right now, but it's actually pretty good with this sort of work, and I'd really like to be able to do things like on-demand SSL registration and sending logs via a distributed message queue. Hacking nginx into doing this sort of thing has diminishing returns, we're kindof at the wall as-is.

[+] e12e|10 years ago|reply

> We're still working on open sourcing (!) our serving infrastructure, so eventually you will be able to see all of the code we use for this (sans the secrets and keys, of course).

I just want to applaud what you've been doing with neocites -- when the project started, I thought "Oh, nice." -- but not much more -- but I love the fact that you've kept at it, and your approach to openness is great (pending infrastructure code notwithstanding). I especially like your status-page:

https://neocities.org/stats

(Which I found from your excellent update blog-post[1] -- but I think it could be even more discoverable. It's not linked from the donate/about pages?)

I hope your financial situation improves -- and still: I wonder how (almost) half-a-cent of revenue/month compares to most ad-funded startups sites? While you'll need... a "few" more to reach your goal. Actually "just" need 43x as many users with the same revenue/head to get there :-) )

ed: clarity (hopefully)

[1] https://neocities.org/blog/the-new-neocities

[+] datums|10 years ago|reply

Have you looked at http://www.cis.upenn.edu/~bcpierce/unison/

[+] nichochar|10 years ago|reply

I didn't know about this, but really wanted this to exist, and it makes me happy! Keep it up

[+] Pfiffer|10 years ago|reply

We used a master/master DRBD setup at a previous company, it was kind of a pain to work with. We had a fairly extensive document to solve split-brain problems.

I imagine the problems with DRBD mostly disappear if you're using it properly though, master/slave setups probably work really well.

[+] shanemhansen|10 years ago|reply

lsyncd seems like a really cool project. However in practice I used it to replicate a docroot accross 3 servers and it actually got out of sync pretty often.

[+] jrochkind1|10 years ago|reply

All of github pages was run off of _one_ server? (with one failover standby).

That's pretty amazing. If all you're serving is static assets, apparently you have to grow to pretty huge scale before one server will not be sufficient.

I'm curious if there was at least a caching layer, so every request didn't hit the SSD. They didn't mention it.

[+] haileys|10 years ago|reply

We do have Fastly in front of *.github.io, but there's still a significant amount of traffic (on the order of thousands of requests per second throughout the day) that make it through to our own infrastructure.

We don't do any other caching on our own, although the other replies are correct in that the Linux kernel has its own filesystem cache which means not all requests end up hitting the SSD.

[+] vog|10 years ago|reply

I believe this is automatically done by the operating system. Nginx usually performs some kind of mmap of the requested file, and the virtual memory management will automatically take care of caching.

BTW, I'm often surprised how people are afraid from opening files from their scripts, as if they think this will always lead to disk access. Then, they start implementing a hand-written caching layer on top of that, which usually performs worse than what the OS already offers.

[+] kyledrake|10 years ago|reply

Correct. Static file serving is amazingly efficient. Even the filesystem caches for you automagically with whatever RAM you're not using. There's even a kernel function for speeding up file transfers, that's how optimized this stuff is: http://man7.org/linux/man-pages/man2/sendfile.2.html

My rough estimates are that Neocities can handle 20-50 million sites with only two fileservers and a sharding strategy. So double it for each shard, and you've got a solution that actually scales pretty well.

[+] bkeroack|10 years ago|reply

It's not that surprising. Serving static files is pretty easy and lightweight (part of why I like the movement towards SPAs), and there's been 20+ years of work done to make it fast. We've been conditioned for years by dynamic application servers that perform multiple DB queries, external service calls, etc per request. It doesn't have to be slow or inefficient.

[+] nomel|10 years ago|reply

SSD cache and os file system cache is always there. I'm sure they were filled with ram, and that's all it would be used for.

[+] adriancooney|10 years ago|reply

I'm amazed that Github Pages ran on just two servers (well, aside from the MySQL clusters). That is absolutely incredible given the sheer amount of projects and people who rely on it for their sites (me included!). I love the philosophy behind Pages of abstaining from over-engineering and sticking to the simple, proven solutions. It's a great service and I'm a massive fan.

[+] VeejayRampay|10 years ago|reply

Well done GitHub. Also a special mention to the invisible workers making nginx such a cornerstone of the modern infrastructure, it's a project that I don't hear about often, probably due to the fact that it's not the sexiest piece of technology, but it really seems solid and battle-tested. Kudos.

[+] nicolewhite|10 years ago|reply

I've been using GitHub pages for a while now and I always wondered why they had the "your site may not be available for another 30 minutes" message on creating a new GitHub pages site while pushes to an already-existing gh-pages branch were displayed instantly. Neat to see that explained here.

[+] maxmcd|10 years ago|reply

This is all sitting behind a CDN correct?[1] Might explain why it was able to survive on two servers for so long.

Or is that automatically assumed when reading about a static hosting setup?

1. https://www.fastly.com/customers/github/

[+] manigandham|10 years ago|reply

It's mentioned in the article.

> We also have Fastly sitting in front of GitHub Pages caching all 200 responses. This helps minimise the availability impact of a total Pages router outage. Even in this worst case scenario, cached Pages sites are still online and unaffected.

[+] mwcampbell|10 years ago|reply

Only tangentially related, but I sometimes wonder if GitHub's management regret making GitHub Pages available for free, now that it's being used so heavily for personal and even business blogs, rather than just companion sites for open-source projects. They could be charging for static websites, as Amazon S3 does.

[+] holman|10 years ago|reply

I never heard anyone gripe about it... not even once. The cost is pretty negligible, and there's a lot of halo benefit (i.e., you just get more people involved on GitHub the platform itself).

The fact that a lot of non-technical employees in marketing and other fields are using it for corporate blogs is actually a nice bit of pressure on the organization to make Pages and web editing even simpler for those users. It becomes harder to lean on "oh it's a developer site so they'll figure it out".

Mostly, though, I think it's just a matter that we wanted it for ourselves. It's pretty awesome from an industry bystander's perspective to have something free, simple, and static, so we can all benefit from more stable docs, blogs, and so on. Maybe that'll change in the future and something Totally Different will change the industry, but for right now I think it's pretty rad, and totally worth the investment.

[+] ceejayoz|10 years ago|reply

If it was able to run off a single server for that long, I suspect the goodwill it engendered (as well as the familiarity with Git/Github.com it built in a lot of people) was well worth the minimal resources and cost it entailed.

[+] dyladan|10 years ago|reply

I doubt it. After seeing how simple the architecture is to run it, I'm sure it's a drop in the proverbial bucket. Pages drives traffic to the site and in order to serve a site from a private repo you have to be a paying customer anyways.

[+] jsingleton|10 years ago|reply

Nice! Does HN still run off of a single server and CDN too?

The CDN is key here, which you get if you use a CNAME (or ALIAS) instead of an A record for your custom domain on GH pages. I've found pairing pages with CloudFlare works great if you want to use a naked domain and you get HTTPS too. You can set up a page rule on CF to redirect all HTTP to HTTPS as well.

[+] nvk|10 years ago|reply

It's time for github to start offering some basic hosting infrastructure of small projects, a light heroku, at least for JavaScript (which kind of already works).

I'd pay extra for that, I (we all) have a bunch of personal sites, landing pages, marketing sites and tiny side projects that'd love to not have to deal with hosting – I think they'd make a killing, but also think must be in the works.

[+] spdionis|10 years ago|reply

I think the leap they'd have to make in infrastructure and architecture to support that is not worth it in their mind. But who knows.

[+] lstoll|10 years ago|reply

Heroku already does this pretty well, not sure what the benefit would be?

[+] ngrilly|10 years ago|reply

Great summary of your architecture. Thanks for sharing!

A few questions:

- Is everything in the same datacenter or in different datacenters? What happens if the datacenter is unavailable for some reason? Are data replicated somewhere?

- You moved from 2 machines to at least 10 (at least 2 load balancers, 2 front ends, 1 MySQL master, 1 MySQL slave and 2 pairs of fileservers). That's a lot more. Do you need more machines because you need more capacity (to serve the growing traffic) or just because the new architecture is more distributed and requires more machines by "definition"?

- I understand the standby fileservers are idle most of the time: reads go the active fileserver, only writes are replicated to the standby. Am I understanding correctly? If yes, it looks like a "wasted" capacity?

[+] jsingleton|10 years ago|reply

Something I would really like is to be able to set the custom MIME type for an app cache manifest file. That way you could easily host offline web apps from GH pages. Anyone know a way to do this without using S3 or similar?

https://en.wikipedia.org/wiki/Cache_manifest_in_HTML5#Basics

[+] ryanseys|10 years ago|reply

You shouldn't need to specify a custom one. GitHub Pages will automatically serve the file with the appropriate mime type given its file extension. Here [1] is the list.

[1]: https://github.com/jekyll/jekyll/blob/master/lib/jekyll/mime...

Edit: As you can see in that link, both .manifest and .appcache file extensions map to text/cache-manifest mime type.

[+] datums|10 years ago|reply

Have you thought about using bind as the db for the routing ? an internal dns lookup for the storage node storage.url -> 10.0.12.1

[+] BillinghamJ|10 years ago|reply

Seems odd to me that the router hits a MySQL database on every single request rather than just hashing the hostname as the key for the filesystem node.

[+] tdicola|10 years ago|reply

Hash-based partitioning has a big problem that when you change the hash size all of the data moves around. Eventually you'll need to do a lookup-based partitioning scheme. You also probably want control over where some users live since you don't want two super hot users on the same server.

[+] linc01n|10 years ago|reply

I thought github pages is running on riak and webmachine from 2012[0].

[0] https://speakerdeck.com/jnewland/github-pages-on-riak-and-we...

[+] jnewland|10 years ago|reply

I threw out that prototype soon after the talk. At the time, there weren't a lot of other engineers at the company doing Erlang, so maintenance was considered to be a long-term problem. I'm glad we made that call.

[+] tracker1|10 years ago|reply

It seems to me, they could have gone a farther step removed via something like Cassandra. With a Cassandra cluster, they could have used a partition key that is the domain name + route in question, they could then do lookup against that entry, with the resource path (excluding querystring params) could be used to find a single resource in cassandra, and return it directly.

A preliminary hit against a domain forwarder would be a good idea as well, but for those CNAME domains, dual-publishing might be a better idea... where the github name would be a pointer for said redirect.

While Cassandra itself might not be quite as comfortable as say mySQL, in my mind this would have been a much better fit... Replacing the file servers and the database servers with a Cassandra cluster... Any server would be able to talk to the cluster, and be able to resolve a response, with a reduced number of round trips and requests... though the gossip in Cassandra would probably balance/reduce some of that benefit.

[+] samlambert|10 years ago|reply

Adding a database that is new to GitHub would not be a pragmatic move.

[+] Sir_Cmpwn|10 years ago|reply

I remember the time I mistakenly drove huge amounts of traffic to Github Pages, believing they had the infrastructure to handle it. I apologise for last year's downtime :)

Glad to hear it's being improved. I'm impressed that it was able to run on such simple infrastructure for so long.

[+] cddotdotslash|10 years ago|reply

I doubt you caused any downtime. All the page content is behind a CDN.

[+] methyl|10 years ago|reply

I'm wondering why GitHub prefers MySQL over PostgreSQL.

[+] samlambert|10 years ago|reply

Stable, proven, popular, great roadmap, great replication story, great tooling and an awesome community.

[+] nakovet|10 years ago|reply

From OP: > we made sure to stick with the same ideas that made our previous architecture work so well: using simple components that we understand and avoiding prematurely solving problems that aren't yet problems

So, if their team have lots of experience with MySQL but not so much with PostgreSQL, that could be a good reason to prefer one over another.

[+] misterbee|10 years ago|reply

Full ACID reliability is not mission critical for them.

[+] baghali|10 years ago|reply

Question from Githubbers:

Have considered using cluster file systems such as GlusterFS or Ceph?

[+] kyledrake|10 years ago|reply

I looked into GlusterFS at one point. GlusterFS is a no-go for static file serving in hostile environments. It asks every node to look for a file, even if it's not there. You can imagine the DDoS attacks you could build here using a bunch of 404 requests for files that don't exist.

One story I heard from a PHP dev is that it would take 30 seconds to load a page while it looked for all the files needed to run it.

[+] samlambert|10 years ago|reply

I don't know if it has been considered, however we do have a strong pattern for using DRDB.

[+] el33th4xx0r|10 years ago|reply

Surprisingly, they uses mysql (instead of current hype k/v store) to map hostname and fileserver.

[+] charliesome|10 years ago|reply

MySQL is actually a really good key value store!

Here's the schema we use for the routing information:

    CREATE TABLE `pages_routes` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `user_id` int(11) NOT NULL,
      `host` varchar(255) NOT NULL,
      PRIMARY KEY (`id`),
      UNIQUE KEY `index_pages_routes_on_user_id` (`user_id`)
    );

Since we use MySQL for everything else, we decided it made the most sense to keep this routing data here rather than introducing a new database.

[+] joshrotenberg|10 years ago|reply

I remember seeing a talk about a Github Pages rearchitecture at Erlang Factory in SF in 2012: http://www.erlang-factory.com/conference/SFBay2012/speakers/...

I don't see anything about those components in this post. Did that architecture never make it to production?

149 comments