top | item 11030550

(no title)

tonylxc | 10 years ago

TL;DR: "We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, ..."

This doesn't sound very good.

discuss

order

jpatokal|10 years ago

No, it sounds good, because it's realistic and then you can build mitigation strategies.

I was recently involved in an outage that occurred because the sama datacenter was hit by lightning three times in a row. Everything was redundant up the wazoo and handled the first two hits just fine, but by the time the power went out for the third time within N minutes, there wasn't enough juice left in some of the batteries!

Now would it be possible to build an automated system that can withstand this? Probably. But would your time & money be better spend worrying about other failure modes? Almost certainly.

jrockway|10 years ago

If your plan to avoid downtime is to prevent power outages, you're going to have downtime. All their sentence says is they can't prevent power outages. That's fine, because the other 1/nth of your servers are on a different power grid in a different state.

tonylxc|10 years ago

I totally share the same view that to best avoid failure is to embrace it and cope with it.

It is true that all their sentence is about recovery, however, it is disappointing that they didn't mention anything about a redundant datacenter.

otterley|10 years ago

Whose datacenter are they in? This is the second time in less than two weeks that they've suffered a power-related issue. My company is in 4 different sites around the world and we've never lost power ever - and, if one circuit did go out, we'd still be up and running because all of our servers have redundant power supplies on separate infeed circuits.

theptip|10 years ago

The rest of the sentence is pertinent:

"...but we can take steps to ensure recovery occurs in a fast and reliable manner. We can also take steps to mitigate the negative impact of these events on our users."

The lessons that giants like Netflix have learned about running massive distributed applications show that you cannot avoid failure, and instead must plan for it.

Now, having a single datacenter is not a good plan if you want to give any sort of uptime guarantee, but that's a different point to make.

tonylxc|10 years ago

My point is: they shouldn't ONLY plan on ensuring recovery occurs fast; they should also plan on having multiple data centers, which to me is more important. It's frightening to know that such an important service is only operating in a single data center.

However, their recovery report didn't mention anything about such a plan.

<< Edited: correct a grammar error.