top | item 28752431

(no title)

l9i | 4 years ago

I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:

Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.

I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.

While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?

discuss

order

grayfaced|4 years ago

In future news: Waymo outage results in engineers unable to get to data center. Engineers don't even know where their servers are.

shemnon42|4 years ago

Give us the dirt on how google does it's disaster planning exercises please! Do you do these exercises all at once or slowly over the year?

l9i|4 years ago

Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.

Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.

While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.

If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).