This is known generally as the "Thundering Herd" problem:
The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.
This occurs repeatedly, until there are no more processes to be woken up. Because all the processes use system resources upon waking, it is more efficient if only one process is woken up at a time.
This may render the computer unusable, but it can also be used as a technique if there is no other way to decide which process should continue (for example when programming with semaphores).
Though the phrase is mostly used in computer science, it could be an abstraction of the observation seen when cattle are released from a shed or when wildebeest are crossing the Mara River. In both instances, the movement is suboptimal.
We actually encounter Thundering Herd problems on a very regular basis. The Starbucks page has nearly 14M fans and posts may get tens of thousands of comments/likes with a high update rate. You have a lot of readers on a frequently changed value which means it is not often current in cache and you can have a pileup on the database.
Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.
Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.
This is a common pitfall when people start using Memcache. Memcache puts less load on the DB, which means you can keep the same DB hardware and scale up your pageviews...until Memcache craps out and all that load is now back on the DB again :(
When stuff breaks I find that this phenomena can oftentimes make it hard to differentiate the source of your problem from symptoms of your problem. We usually would do it by going through the logs and see what problems showed up first.
Are there any architectures/patterns/methods that can help make it easier to find the source of performance issues?
I'm often encountering systems that are designed for very short queries processing small numbers of rows where the number of connections from the app servers is configured to be far greater than the number of CPUs.
Since those systems are doing very little IO, configuring connection pools to start with much more connections than the DB has CPUs and to add more if the connections are busy (i.e. DB is getting slower, probably because it is highly loaded), is guaranteed to cause resource contention on the DB and escalate the issue in case something goes wrong.
Its a total waste and yet the most common configuration in the world.
This certainly won't be the first time that a system designed to increase uptime actually reduces it. I've seen a lot of "redundant" systems that are actually less reliable than simple standalones thanks to all the extra complexity of clustering.
I guess at Facebook's scale you have to build in fallbacks but this is a reminder that you can easily do more harm than good.
Very true - the complexity of the system also comes into the equation when things go wrong and it takes real people to figure out why an outage is happening. Highly complex systems imply longer debugging time, and at a certain point a theoretically lower up time can give you higher practical up time just because engineers can actually understand and debug it.
Sometimes cluster systems are mistaken for high-availability solutions while they are actually load-balancing solutions and can decrease availability due to added complexity and dependencies.
In am not sure that increasing uptime was a goal for this system. The way I read the story, this is about a system designed to make it easier to manage a server farm that was misconfigured and not robust against such configuration errors. So, it is more about a system designed to keep a server farm consistent. Staging those configuration changes could be a way to decrease the effects of future such errors, as it gives operators time to notice that something is amiss.
The other thing is that systems that cut across your architecture need to be handled with utter paranoia. They'll be the coupling points that cause global outages.
Stick with Mysql
* * * * If it aint broke, dont fix it!
Me too Melissa, and it's out there in the media that a group of hackers caused the problem, is this true, Mr Robert Johnson?
PLEASE !!!! WHAT CAN YOU DO TO HELP THIS FROM EVER HAPPENING AGAIN???????????????? PLEASE!!!!!!!!!!!!!!! CANDY
Kip da updates comin'
Did anyone get a message like I did about someone trying to access your account from another state?
I actually came across a pretty good comment which explains it (to the layman) pretty well:
"Marvin, the server was like a dog chasing its tail...it kept going in circles, but never caught it. They basically had to hold the tail for the dog so he could bite a flea on it. :) LOL"
They built their own custom DNS server? Most of the failures that people encountered were a failure to contact the nameserver itself. Perhaps in the rush to try to fix it someone screwed up that as well.
It seems like that was their attempt to "shutdown" the site. It's a pretty effective way too... The easiest way to stop the stampede was to drop off the net completely. By tweaking their dns, they were able to give themelves enough room to breathe. Then they could slowly start to bring people back.
That's just a guess. But from their post, it seems reasonable.
So umm no mention of the 4chan DOS attack? I mean, not that I hang out there or anything, but a friend told me that they organized an attack. You'know. Jus sayin. /b/ye
[+] [-] davidu|15 years ago|reply
The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.
This occurs repeatedly, until there are no more processes to be woken up. Because all the processes use system resources upon waking, it is more efficient if only one process is woken up at a time.
This may render the computer unusable, but it can also be used as a technique if there is no other way to decide which process should continue (for example when programming with semaphores).
Though the phrase is mostly used in computer science, it could be an abstraction of the observation seen when cattle are released from a shed or when wildebeest are crossing the Mara River. In both instances, the movement is suboptimal.
From: http://en.wikipedia.org/wiki/Thundering_herd_problem
[+] [-] schrep|15 years ago|reply
Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.
Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.
[+] [-] Fluxx|15 years ago|reply
[+] [-] sbov|15 years ago|reply
Are there any architectures/patterns/methods that can help make it easier to find the source of performance issues?
[+] [-] ora600|15 years ago|reply
Since those systems are doing very little IO, configuring connection pools to start with much more connections than the DB has CPUs and to add more if the connections are busy (i.e. DB is getting slower, probably because it is highly loaded), is guaranteed to cause resource contention on the DB and escalate the issue in case something goes wrong.
Its a total waste and yet the most common configuration in the world.
[+] [-] mikey_p|15 years ago|reply
[+] [-] cliffchang|15 years ago|reply
[+] [-] cageface|15 years ago|reply
I guess at Facebook's scale you have to build in fallbacks but this is a reminder that you can easily do more harm than good.
[+] [-] smakz|15 years ago|reply
[+] [-] ora600|15 years ago|reply
[+] [-] Someone|15 years ago|reply
[+] [-] jasonwatkinspdx|15 years ago|reply
[+] [-] sethwartak|15 years ago|reply
[+] [-] swaarm|15 years ago|reply
"Marvin, the server was like a dog chasing its tail...it kept going in circles, but never caught it. They basically had to hold the tail for the dog so he could bite a flea on it. :) LOL"
[+] [-] catshirt|15 years ago|reply
[+] [-] riffer|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] jrockway|15 years ago|reply
"You are using an incompatible web browser.
Sorry, we're not cool enough to support your browser. Please keep it real with one of the following browsers:"
This is why I don't use Facebook. It's not 1990 anymore. You don't need User-Agent sniffing.
[+] [-] rimantas|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] itistoday|15 years ago|reply
[+] [-] duck|15 years ago|reply
[+] [-] AgentConundrum|15 years ago|reply
Interesting article, terrible comments.
[+] [-] ergo98|15 years ago|reply
[+] [-] mbreese|15 years ago|reply
That's just a guess. But from their post, it seems reasonable.
[+] [-] schrep|15 years ago|reply
[+] [-] dangrossman|15 years ago|reply
[+] [-] brown9-2|15 years ago|reply
[+] [-] kahawe|15 years ago|reply
[+] [-] praeclarum|15 years ago|reply
[+] [-] ramidarigaz|15 years ago|reply
[+] [-] xentronium|15 years ago|reply
After facebook's outage started, there were like 9128 threads in /b/ with trolls claiming it was their success.