(no title)
l9i | 4 years ago
Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.
I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.
While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?
grayfaced|4 years ago
shemnon42|4 years ago
l9i|4 years ago
Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.
While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.
If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).
donalhunt|4 years ago