I think this is another good example of how we as an industry are still unable to adequately assess risk properly.
I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.
You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
Whoever was in charge of disaster recovery obviously didn't really understand the risk.
That's not necessarily true. People don't die when twitter is down, and whatever twitter's business model actually is, I am not even sure there is a monetary penalty to them being down (unlike, say, Amazon being down which results in lost orders). They may have made the calculation that it was not cost effective engineering-wise to chase that extra 0.001% of reliability.
[Edit: Pedantry shield: Ok, ok, should have said people don't die because twitter is down. Obviously people are dying all the time, and some will indeed expire while twitter is down].
"I think this is another good example of how we as an industry are still unable to adequately assess risk properly."
It is likely that what you mean by "properly" is impossible. At large enough scales, what you end up with is a Gaussian distribution of errors in accordance with the Central Limit Theorem... except that there's a Black Swan spike in the low-probability, high-consequence events, and you basically can't spend enough money to ever get rid of them. Ever. Even if you try, you just end up piling equipment and people and procedures which will, themselves, create the black swan when they fail.
I think you're trying to imply that if only they'd understood better, this could absolutely have been prevented. No. Some specific action would probably have been able to avert this but you simply don't have a 100% chance of calling those actions in advance, no matter how good you are.
The state space of these systems is incomprehensibly enormous and there is no feasible way in which you can get all the failures out of it, neither in theory nor in practice.
Living in terror of the absolute certainty of eventual failure is left as an exercise for the reader.
> I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Is this really a valid conclusion to come to at this point? I expect downtime in any service I operate. It's just how the world works. Does that mean I don't understand the risks and am misleading the board?
Any assessment of risk entails certain larger assumptions about the world, many of which often turn out to be mere guesses.
Consider all the prices that are set to their current levels b/c nobody expects the collapse of the US political system to occur. Yet there is a nonzero probability that it will occur.
On one hand this seems like an absurd example, yet it exemplifies the kind of blind spot we are prone to when assessing risk. We generally address all the risks we can directly control, then classify the rest as "systemic" which essentially means that we are not able to compute them so we're going to ignore them.
Yet many systems which we assume to be stable or predictable (governments, companies, markets, weather patterns, social trends, etc.) have unexpected aberrations now and then which can have very significant consequences. Since these tend to impact most companies equally, the market will converge on an equilibrium where no firms do anything to hedge against these things.
Do you want to pay extra bank fees so that your bank can hedge against the collapse of the US currency for your checking account? Probably not. Do you want to triple your hosting costs to hedge against a massive US power grid failure? Probably not. The same applies to asteroid risk and sudden ice age risk.
On the other hand, if you have lots of money saved, you may wish to hedge against the collapse of one currency or another, and if your business would end if you suffered a few hours of downtime, you might want to invest in massive amounts of redundancy.
Every morning when we all commute to work we risk death. Some exposure to systemic risk is considered acceptable, and part of the character of any person or business is the kind of risk exposure we tolerate day to day. A doctor working in an AIDS clinic risks needle sticks and HIV, a startup doubling its users each month risks downtime but also risks a cash flow crisis.
Your examples describe two entirely different systems. The failover of a software product is drastically different from the failover of a power system. Trying to map everything back to a common best practice under the category of "risk" seems like it would miss out on important intricacies.
That's not really fair, we as humans are bad at risk analysis. Bruce Schneier has written extensively on the role of cognitive biases et al in measuring perceived risk.
You've never done DR have you? It's a business process with a cost and there are RTO's (recovery time objectives) and RPO's (recovery point objectives). Systems can and will go down. So long as the recovery meets the defined objectives, then DR has been performed correctly. There is a limited amount of money and resources that businesses can spend on DR, COOPs, etc. You should understand that.
One hour RTO and RPO will cost way more than 24 hour recovery. Edit... and the business managers decide how much they wish to spend on DR. It's a trade-off and anyone who has ever done it, understands that.
I'd guess the higher ups at Twitter have run the cost benefit in their head (and probably many spreadsheets) plenty of times, and in most cases spending your limited resource on disaster recovery preparation just isn't worth it. Their site being down does not qualify as a "disaster" - they'll be back up soon, then we'll all be tweeting away again within minutes.
This is a great point, and this isn't just about Twitter but also about many other sites and services that seem to depend on it. It looks like a lot of people have created a distributed system version of dependency hell for themselves, where they rely on a multitude of third parties not to change behaviour or go down. Additionally, in many cases and perhaps perfectly legitimately from a cost-benefit perspective, the envisaged way to recover from this kind of problem is to assume that people can quickly and frantically hack their way out of it at short notice.
(expanding my reply) No risk assessment in the world will stop a cage monkey from tripping over a pile of 1Us and falling onto the big red button. Figure out what your pain threshold is and live with it.
It is simply a matter of perceived value and cost benefit. Why would a cio spend millions on Dr when the probability of diaster is so minute that the risk manager cannot even calculate it? Ok there is a risk that a plane will hit the pdc. .00000002%. And ultimately will our business grind to a halt? Or can we use a manual workaround until backups recover to sdc and we capture data lost since last backup. I mean I have a hard time taking this sort of risk seriously unless I'm running dialysis machines and someones life is at risk.
> ... and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
Risk management consultants already exist! Many companies just choose to assess it internally, or the consultants themselves are inexperienced (often lacking practical experience).
I think that a lot of you guys are confusing "Disaster Recovery" with "Business Continuity".
Disaster Recovery is a reactive approach. It's what you do to get things back up AFTER a system or site has failed.
Business Continuity is a proactive approach. It's what you do to ensure that your critical services will remain viable whenever disaster occurs.
In the cases of Heroku, Amazon, Twitter, and many more, their Disaster Recovery strategies have been successful. The fact that they came back online without major data loss is proof of that. Their business continuity strategies, however, have been found wanting.
We don't exactly know if the services troubles in this down times you cite were caused by a disaster, so maybe even the disaster recovery thing may be lacking.
I hope they write up a post-mortem on the fallout (hopefully it won't be a post-mortem of Twitter). Those things are always extremely interesting with big infrastructure like this.
Always fun when you're developing against an API, and then have to perform a frantic investigation to work out if your latest code change broke everything... or it's just the API endpoint itself.
Sure, being upset/getting angry just because of a little bit of Twitter downtime is stupid, but that doesn't take away from the fact that one of the biggest and most important discussion and communication channels the web has is completely down.
That's a good blog post. But "x is down" is newsworthy for sufficiently large number of users of x and sufficently long downtime. That's because consumer experience with this or that online service influences the online service's reputation. A service with few users, or users who have lower expectations, can endure more downtime without loss of reputation than a service with many users who expect the service simply always to be available, at least as available as broadcast television or plain-old telephone service.
This deserves to be the top comment. Your one liner nailed it. Twitter was down long enough for far more people not to notice than did notice. Shit goes down. It always will. Whining about how whoever needs backups or failover protection or distributed networks of servers across the planet or should use a VPS instead or a dedicated server instead or Heroku instead or EC2 instead or a combination of all that crap doesn't make you right. It makes you a speculator. No amount of fallbacks will give you 100% uptime ever. And calling this a massive failure is also ludicrous. It's just some downtime. It went right back up so chill.
These posts are so incredibly annoying. We can see if service x is down for ourselves. That isn't news. I could maybe accept these stories if the link on the front page was to a blog post stating that not only is service x down but why it went down for sure plus an added lesson we can learn from it. Short of that it's become an easy way for people to build up a trillion karma points. And if you want to tell me you don't care about karma then you're either lying or you have none. Enough with this crap. We'll find out ourselves but most of us won't actually because we have lives and by the time we go online to check our favorite wank-off site it'll probably be back up again like the past fifty times I've seen a story about Heroku/AWS/Twitter being down.
I'm glad this made it to the front page. Is the topic itself newsworthy? Not on its own. Is all the discussion that's flooding into this thread worth having?
Yep. Even the subthread from the person complaining that this isn't newsworthy.
If you're debugging webservices that suddenly slow down (timeouts of 10s), this may be your cause if they depend on s.twitter.com, search.twitter.com or api.twitter.com.
As a workaround for those systems, add s.twitter.com, search.twitter.com and api.twitter.com in your /etc/hosts file that map back to 127.0.0.1.
This obviously breaks Twitter integration, but it also makes sure page loads don't explode when waiting for remote resources.
[+] [-] steve8918|13 years ago|reply
I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.
You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
[+] [-] frossie|13 years ago|reply
That's not necessarily true. People don't die when twitter is down, and whatever twitter's business model actually is, I am not even sure there is a monetary penalty to them being down (unlike, say, Amazon being down which results in lost orders). They may have made the calculation that it was not cost effective engineering-wise to chase that extra 0.001% of reliability.
[Edit: Pedantry shield: Ok, ok, should have said people don't die because twitter is down. Obviously people are dying all the time, and some will indeed expire while twitter is down].
[+] [-] jerf|13 years ago|reply
It is likely that what you mean by "properly" is impossible. At large enough scales, what you end up with is a Gaussian distribution of errors in accordance with the Central Limit Theorem... except that there's a Black Swan spike in the low-probability, high-consequence events, and you basically can't spend enough money to ever get rid of them. Ever. Even if you try, you just end up piling equipment and people and procedures which will, themselves, create the black swan when they fail.
I think you're trying to imply that if only they'd understood better, this could absolutely have been prevented. No. Some specific action would probably have been able to avert this but you simply don't have a 100% chance of calling those actions in advance, no matter how good you are.
The state space of these systems is incomprehensibly enormous and there is no feasible way in which you can get all the failures out of it, neither in theory nor in practice.
Living in terror of the absolute certainty of eventual failure is left as an exercise for the reader.
[+] [-] sausagefeet|13 years ago|reply
Is this really a valid conclusion to come to at this point? I expect downtime in any service I operate. It's just how the world works. Does that mean I don't understand the risks and am misleading the board?
[+] [-] grandalf|13 years ago|reply
Any assessment of risk entails certain larger assumptions about the world, many of which often turn out to be mere guesses.
Consider all the prices that are set to their current levels b/c nobody expects the collapse of the US political system to occur. Yet there is a nonzero probability that it will occur.
On one hand this seems like an absurd example, yet it exemplifies the kind of blind spot we are prone to when assessing risk. We generally address all the risks we can directly control, then classify the rest as "systemic" which essentially means that we are not able to compute them so we're going to ignore them.
Yet many systems which we assume to be stable or predictable (governments, companies, markets, weather patterns, social trends, etc.) have unexpected aberrations now and then which can have very significant consequences. Since these tend to impact most companies equally, the market will converge on an equilibrium where no firms do anything to hedge against these things.
Do you want to pay extra bank fees so that your bank can hedge against the collapse of the US currency for your checking account? Probably not. Do you want to triple your hosting costs to hedge against a massive US power grid failure? Probably not. The same applies to asteroid risk and sudden ice age risk.
On the other hand, if you have lots of money saved, you may wish to hedge against the collapse of one currency or another, and if your business would end if you suffered a few hours of downtime, you might want to invest in massive amounts of redundancy.
Every morning when we all commute to work we risk death. Some exposure to systemic risk is considered acceptable, and part of the character of any person or business is the kind of risk exposure we tolerate day to day. A doctor working in an AIDS clinic risks needle sticks and HIV, a startup doubling its users each month risks downtime but also risks a cash flow crisis.
[+] [-] timdorr|13 years ago|reply
[+] [-] rhizome|13 years ago|reply
http://www.schneier.com/blog/archives/2009/08/risk_intuition...
Googling "Schneier risk" gets you lots and lots of reading material.
The only question here is whether Arrington is going to write another Amateur Hour post about this.
http://techcrunch.com/2008/04/23/amateur-hour-over-at-twitte...
[+] [-] 16s|13 years ago|reply
One hour RTO and RPO will cost way more than 24 hour recovery. Edit... and the business managers decide how much they wish to spend on DR. It's a trade-off and anyone who has ever done it, understands that.
[+] [-] arrel|13 years ago|reply
[+] [-] debacle|13 years ago|reply
In almost every disaster situation I've been a part of, the UPSes have failed. Almost every time.
Disaster recovery has a horrible track record.
[+] [-] alex-g|13 years ago|reply
[+] [-] peterwwillis|13 years ago|reply
(expanding my reply) No risk assessment in the world will stop a cage monkey from tripping over a pile of 1Us and falling onto the big red button. Figure out what your pain threshold is and live with it.
[+] [-] wtetzner|13 years ago|reply
http://www.theregister.co.uk/2001/09/10/its_bofh_disaster_re...
[+] [-] srobbie|13 years ago|reply
[+] [-] elithrar|13 years ago|reply
Risk management consultants already exist! Many companies just choose to assess it internally, or the consultants themselves are inexperienced (often lacking practical experience).
[+] [-] zupreme|13 years ago|reply
Disaster Recovery is a reactive approach. It's what you do to get things back up AFTER a system or site has failed.
Business Continuity is a proactive approach. It's what you do to ensure that your critical services will remain viable whenever disaster occurs.
In the cases of Heroku, Amazon, Twitter, and many more, their Disaster Recovery strategies have been successful. The fact that they came back online without major data loss is proof of that. Their business continuity strategies, however, have been found wanting.
[+] [-] antirez|13 years ago|reply
[+] [-] johnyzee|13 years ago|reply
[+] [-] loceng|13 years ago|reply
[+] [-] mootothemax|13 years ago|reply
[+] [-] CD1212|13 years ago|reply
[+] [-] wtetzner|13 years ago|reply
[+] [-] jgrahamc|13 years ago|reply
[+] [-] kristofferR|13 years ago|reply
Sure, being upset/getting angry just because of a little bit of Twitter downtime is stupid, but that doesn't take away from the fact that one of the biggest and most important discussion and communication channels the web has is completely down.
[+] [-] tokenadult|13 years ago|reply
[+] [-] taylorbuley|13 years ago|reply
Twitter is infrastructure for us in media. It's well worth discussion.
[+] [-] billpatrianakos|13 years ago|reply
These posts are so incredibly annoying. We can see if service x is down for ourselves. That isn't news. I could maybe accept these stories if the link on the front page was to a blog post stating that not only is service x down but why it went down for sure plus an added lesson we can learn from it. Short of that it's become an easy way for people to build up a trillion karma points. And if you want to tell me you don't care about karma then you're either lying or you have none. Enough with this crap. We'll find out ourselves but most of us won't actually because we have lives and by the time we go online to check our favorite wank-off site it'll probably be back up again like the past fifty times I've seen a story about Heroku/AWS/Twitter being down.
[+] [-] MattRogish|13 years ago|reply
[+] [-] leot|13 years ago|reply
[+] [-] ghurlman|13 years ago|reply
[+] [-] timf|13 years ago|reply
Unfortunately there are no details, it just says "there was a cascading bug in one of our infrastructure components".
[+] [-] aeurielesn|13 years ago|reply
No news at the status site either, that beats the purpose of having a dedicated status site.
[+] [-] xabi|13 years ago|reply
Users may be experiencing issues accessing Twitter. Our engineers are currently working to resolve the issue.
[+] [-] gms7777|13 years ago|reply
[+] [-] mkr-hn|13 years ago|reply
Yep. Even the subthread from the person complaining that this isn't newsworthy.
[+] [-] revorad|13 years ago|reply
[+] [-] mikecane|13 years ago|reply
Edit: Nope. Just slow. My tweet appeared.
[+] [-] timdorr|13 years ago|reply
[+] [-] Braasch|13 years ago|reply
[+] [-] tikitaka|13 years ago|reply
[+] [-] Mojah|13 years ago|reply
As a workaround for those systems, add s.twitter.com, search.twitter.com and api.twitter.com in your /etc/hosts file that map back to 127.0.0.1.
This obviously breaks Twitter integration, but it also makes sure page loads don't explode when waiting for remote resources.
[+] [-] Braasch|13 years ago|reply
[+] [-] zainny|13 years ago|reply
[+] [-] sakopov|13 years ago|reply
[+] [-] zbowling|13 years ago|reply