Details on today's Facebook outage

[+] davidu|15 years ago|reply

This is known generally as the "Thundering Herd" problem:

The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.

This occurs repeatedly, until there are no more processes to be woken up. Because all the processes use system resources upon waking, it is more efficient if only one process is woken up at a time.

This may render the computer unusable, but it can also be used as a technique if there is no other way to decide which process should continue (for example when programming with semaphores).

Though the phrase is mostly used in computer science, it could be an abstraction of the observation seen when cattle are released from a shed or when wildebeest are crossing the Mara River. In both instances, the movement is suboptimal.

From: http://en.wikipedia.org/wiki/Thundering_herd_problem

[+] schrep|15 years ago|reply

We actually encounter Thundering Herd problems on a very regular basis. The Starbucks page has nearly 14M fans and posts may get tens of thousands of comments/likes with a high update rate. You have a lot of readers on a frequently changed value which means it is not often current in cache and you can have a pileup on the database.

Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.

Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.

[+] Fluxx|15 years ago|reply

This is a common pitfall when people start using Memcache. Memcache puts less load on the DB, which means you can keep the same DB hardware and scale up your pageviews...until Memcache craps out and all that load is now back on the DB again :(

[+] sbov|15 years ago|reply

When stuff breaks I find that this phenomena can oftentimes make it hard to differentiate the source of your problem from symptoms of your problem. We usually would do it by going through the logs and see what problems showed up first.

Are there any architectures/patterns/methods that can help make it easier to find the source of performance issues?

[+] ora600|15 years ago|reply

I'm often encountering systems that are designed for very short queries processing small numbers of rows where the number of connections from the app servers is configured to be far greater than the number of CPUs.

Since those systems are doing very little IO, configuring connection pools to start with much more connections than the DB has CPUs and to add more if the connections are busy (i.e. DB is getting slower, probably because it is highly loaded), is guaranteed to cause resource contention on the DB and escalate the issue in case something goes wrong.

Its a total waste and yet the most common configuration in the world.

[+] mikey_p|15 years ago|reply

I've also heard this called a Cache Stampede, since it's commonly caused by the sudden invalidation of a cache, which was the case for this outage.

[+] cliffchang|15 years ago|reply

i know this isn't reddit, but when you say "thundering herd", do you mean the processes, or the comments on the story?

[+] cageface|15 years ago|reply

This certainly won't be the first time that a system designed to increase uptime actually reduces it. I've seen a lot of "redundant" systems that are actually less reliable than simple standalones thanks to all the extra complexity of clustering.

I guess at Facebook's scale you have to build in fallbacks but this is a reminder that you can easily do more harm than good.

[+] smakz|15 years ago|reply

Very true - the complexity of the system also comes into the equation when things go wrong and it takes real people to figure out why an outage is happening. Highly complex systems imply longer debugging time, and at a certain point a theoretically lower up time can give you higher practical up time just because engineers can actually understand and debug it.

[+] ora600|15 years ago|reply

Sometimes cluster systems are mistaken for high-availability solutions while they are actually load-balancing solutions and can decrease availability due to added complexity and dependencies.

[+] Someone|15 years ago|reply

In am not sure that increasing uptime was a goal for this system. The way I read the story, this is about a system designed to make it easier to manage a server farm that was misconfigured and not robust against such configuration errors. So, it is more about a system designed to keep a server farm consistent. Staging those configuration changes could be a way to decrease the effects of future such errors, as it gives operators time to notice that something is amiss.

[+] jasonwatkinspdx|15 years ago|reply

The other thing is that systems that cut across your architecture need to be handled with utter paranoia. They'll be the coupling points that cause global outages.

[+] sethwartak|15 years ago|reply

My favorite comments on fb page:

  Stick with Mysql

  * * * * If it aint broke, dont fix it!

  Me too Melissa, and it's out there in the media that a group of hackers caused the problem, is this true, Mr Robert Johnson?

  PLEASE !!!! WHAT CAN YOU DO TO HELP THIS FROM EVER HAPPENING AGAIN???????????????? PLEASE!!!!!!!!!!!!!!! CANDY

  Kip da updates comin'

  Did anyone get a message like I did about someone trying to access your account from another state?

[+] swaarm|15 years ago|reply

I actually came across a pretty good comment which explains it (to the layman) pretty well:

"Marvin, the server was like a dog chasing its tail...it kept going in circles, but never caught it. They basically had to hold the tail for the dog so he could bite a flea on it. :) LOL"

[+] catshirt|15 years ago|reply

"This means that Facebook is a database-driven app :D LOL , just refactor FB, you know it is a mess."

[+] riffer|15 years ago|reply

Yeah, I noticed that too, it's kind of sad when you think about it

[+] unknown|15 years ago|reply

[deleted]

[+] jrockway|15 years ago|reply

The text of the article is:

"You are using an incompatible web browser.

Sorry, we're not cool enough to support your browser. Please keep it real with one of the following browsers:"

This is why I don't use Facebook. It's not 1990 anymore. You don't need User-Agent sniffing.

[+] rimantas|15 years ago|reply

Whoa, who did User-Agent sniffing in 1990?

[+] unknown|15 years ago|reply

[deleted]

[+] itistoday|15 years ago|reply

That was an excellent description of the problem. Too excellent, it seems, for many of the commenters. :-p

[+] duck|15 years ago|reply

Yeah, I don't know how regular users find posts like this... but that was way over everyones head.

[+] AgentConundrum|15 years ago|reply

For a second there, while reading the comments, I wasn't entirely sure I hadn't accidentally been redirected to a redesigned YouTube.

Interesting article, terrible comments.

[+] ergo98|15 years ago|reply

They built their own custom DNS server? Most of the failures that people encountered were a failure to contact the nameserver itself. Perhaps in the rush to try to fix it someone screwed up that as well.

[+] mbreese|15 years ago|reply

It seems like that was their attempt to "shutdown" the site. It's a pretty effective way too... The easiest way to stop the stampede was to drop off the net completely. By tweaking their dns, they were able to give themelves enough room to breathe. Then they could slowly start to bring people back.

That's just a guess. But from their post, it seems reasonable.

[+] schrep|15 years ago|reply

This was a result of us throttling traffic to the site - not the original outage.

[+] dangrossman|15 years ago|reply

Or maybe disabling DNS was how they purposely took down the site, then slowly let people back in, as they said.

[+] brown9-2|15 years ago|reply

Perhaps the DNS issue was their way of shutting off the site.

[+] kahawe|15 years ago|reply

My own theory what actually happened: http://www.reddit.com/r/fffffffuuuuuuuuuuuu/comments/di0v0/s...

[+] praeclarum|15 years ago|reply

So umm no mention of the 4chan DOS attack? I mean, not that I hang out there or anything, but a friend told me that they organized an attack. You'know. Jus sayin. /b/ye

[+] ramidarigaz|15 years ago|reply

DOSing Facebook? It would require a huge effort to create a measurable increase in Facebook's traffic.

[+] xentronium|15 years ago|reply

Don't be too naive. The 4chan guys are great but not at that scale.

After facebook's outage started, there were like 9128 threads in /b/ with trolls claiming it was their success.

51 comments