Google loses data as lightning strikes

[+] idlewords|10 years ago|reply

Relevant amusing bit from the Amazon FAQ: "S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years."

I think my favorite part of that is "on average", as if you will be making repeated ten-million-year trials of this effectively brand new technology.

The point is that once you get into several nines of reliability, really rare events that are impossible to model start to dominate your risk budget.

[+] emmett|10 years ago|reply

Many people have far more than 10k objects though. Based on that math, if you stored 10 billion objects in S3 (certainly well within the realm of possibility), you'd lose an object on average every 10 years, which is a length of time one can think about.

Your main point (that making high reliability systems more conventionally reliable is building your high wall yet higher still) is definitely valid. But the lossage rate is actually a meaningful number given the extremely large number of objects stored in S3.

[+] dmytton|10 years ago|reply

This is S3 which isn't comparable to Google's persistent disks. S3 is equivalent to Google Cloud Storage which has "99.999999999%" durability as per https://cloud.google.com/storage/

To accurately compare them you'd need to look at AWS EBS: "Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% - 0.2%, where failure refers to a complete or partial loss of the volume, depending on the size and performance of the volume."

[+] cjensen|10 years ago|reply

Glacier is also 99.999999999%.

But those nines are only true if Amazon's software is bug-free. I've lost data on Glacier on my home account. They personally called me to apologize.

[+] pjc50|10 years ago|reply

At this point, correlation matters. In the unlikely event that I lose at least 1 object, what is the conditional probability of losing another?

(Forgetting about correlation was a big part of the MBS and LTCM financial failures)

[+] cos2pi|10 years ago|reply

I always think of Feynman's minority report [0] on the Challenger Disaster when these types of risk assessments pop up.

[0] http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/roger...

[+] Zenst|10 years ago|reply

Indeed "on average" covers the entire period and from initial local-node intake of the data it has to then be propergated to another site to increase integrity probability. So if it takes even 5 seconds to propergate to several sites at different locations then the durability scales from when the clock starts.

Heck once in every 10,000,000 years, the odds of winning the USA lottery are around one in 175,000,000 and yet people partake and somebody wins. So it is always worth looking at odds from another perspective and accept it is either impossible, probably or possible and as we know the return from investment in durability is logarithmic in cost going up and return getting smaller.

Still one of the biggest issues in any datacentre is not just the power, but also the quality of the power and even the slightest noise could and does increase the odds of some electronics going wrong. UPS's (though been many years) will put out a square wave in modulation and the ideal is a perfect sine-wave. It is truly educational to see the quality of many power sources on an oscilloscope. Also equipment on the same circuit can also induce noise, so can vary rack to rack.

[+] mikeash|10 years ago|reply

Do they spell out what assumptions they're working under? Because, for example, I'm pretty sure that the odds of a civilization-ending asteroid or comet strike in the next year, while quite low, are higher than what's implied by 99.999999999% durability on trillions of objects."

[+] Taek|10 years ago|reply

That's one of the reasons that I'm excited about decentralized storage. Data stored on dozens or hundreds of nodes spread across multiple countries and jurisdictions is much more robust to things like earthquakes, storms, government intervention, and companies going out of business or changing their profitability model. It has a much stronger defence against black swan events.

When you are storing many billions of files consuming many millions of disks, you become vulnerable to black swan events, simply because you are rolling the dice so many times.

[+] zkhalique|10 years ago|reply

The irony is that, by that time, Amazon will almost certainly not be around. So doing it in terms of time is actually a different calculation, and this is based on wrong assumptions :)

[+] yeukhon|10 years ago|reply

Well the expectation is different for different players. If Facebook lose 1 photo for 1 customer out of 10M customers, FB wouldn't care. The user may just assume a glitch. For a small business, losing 1 photo for 1 customer out of 10,000 customers can be a big deal, but nonetheless not the worst case. Offer sincere apology, do extra data backup/replication if those data are extremely valuable to customer, and roll out a new service.

[+] Tloewald|10 years ago|reply

Another problem is that any errors in assumptions or omissions made when calculating the odds will be enormously magnified.

I wouldn't trust any of these figures unless they have ongoing efforts to test them empirically. E.g. create distributed databases of 100 trillion objects, mess with them in various ways, and perform correctness checks on them.

[+] wodenokoto|10 years ago|reply

You're thinking of average as a frequentist, and Amazon are thinking as Bayesian.

Ever thought about how we can talk about the chance of rain tomorrow? Or the risc that a comet may strike the earth within the next million years?

[+] hyperpape|10 years ago|reply

"just 0.000001% of disk space was permanently affected."

So Google just exhausted their 11 9's for centuries to come.

[+] Dylan16807|10 years ago|reply

You are making repeated trials, they just overlap.

[+] jsprogrammer|10 years ago|reply

Is that durability rate per-... user? project? s3-wide?

Important detail left out.

If there are 10,000,000 projects with 10,000 objects, then a system-wide durability of 99.999999999% would expect to drop 1 object per year.

[+] miander|10 years ago|reply

Sounds like this wasn't caused by power surges reaching the equipment but rather an effect of repeated power loss to drive arrays not fully designed to handle it. The article is pretty unclear. Still sounds like an infrastructure problem though.

[+] magicalist|10 years ago|reply

The incident report ismavis posted below (and linked in the article) has far more information: https://status.cloud.google.com/incident/compute/15056#57195...

[+] linkregister|10 years ago|reply

Google's actual statement with RCA: https://status.cloud.google.com/incident/compute/15056#57195...

[+] Aloha|10 years ago|reply

I work on cell sites, grounding system design and repair is a primary design element, even then, the presumption in the industry is that if a site takes a direct hit - or for that matter a nearby strike - the equipment is a total loss.

The surge suppression gear we put in (lead ins at power feeds, RF feed, etc) is mostly to prevent a fire and to ensure the extra energy goes largely to ground.. but it won't prevent dead gear.

[+] logicallee|10 years ago|reply

Are you saying essentially that "There is no such thing as a surge protector, they don't physically exist. Only surge reducers exist." Because that's what it sounds like to me.

EDIT:

All right, I'll rephrase. According to Google's infobox from nat'l geographic, lightning generates up to 1 billion volts.

-> Are surge protectors at even the highest-end data centers simply not rated to a billion volts of surge protection?

[+] danepowell|10 years ago|reply

This raises a relevant concern that's been on my mind: what's the best way to back up cloud services? Given that services like S3 and Google Drive have many more nines of durability than any local storage system I could devise, are backups even worth the trouble?

There are a lot of cloud-to-cloud backup services out there, but to me that seems like the blind leading the blind, especially with regards to malicious data destruction. For instance, I've recently been experimenting with Cloudally to automatically back up Google Drive, which seems like a good solution at first- until you think about the fact that Cloudally uses Google accounts for authentication (and doesn't use 2FA for native authentication). In other words, an attacker with access to my primary data (Google Drive) would also have access to my backups. Worse than that, Cloudally actually increases the attack surface, since its lack of 2FA presumably makes it easier to crack than my Google account.

Similarly, I'm guessing a lot of cloud backup services share data centers with the services they are backing up.

[+] nemo1618|10 years ago|reply

If you really care about durability, your best best is erasure-coding + a wide geographic distribution of shards. For example, you could encode 1 TB of data into four shards, each shard containing 500 GB. You distribute these to servers in SF, NYC, Berlin, and Sydney. The key here is that you only need two shards to recover your 1 TB of data, and they can be any two shards. So if lightning strikes Berlin, and the Big One hits SF, your data is still safe. And thanks to erasure-coding, you can achieve this with only 2x redundancy (instead of 4x).

[+] njharman|10 years ago|reply

durability is not same as "protection". It doesn't include items like government seizure, hacking. Your own drives, in your physical possession fill in that gap (somewhat).

[+] impostervt|10 years ago|reply

Curious - how do they know lightning hit four times? Was someone outside counting?

[+] cdr|10 years ago|reply

As sz4kerto said, it was probably tracked by building systems. But beyond that, weather services track every single lightning strike in most of the developed world. Eg BELLS: http://radar.meteo.be/en/3337408-Lightning+detection.html#pp...

This is how, for example, you can know whether a wildfire was started by lightning - once a point of origin is determined, simply check the data for strikes.

https://en.wikipedia.org/wiki/Lightning_detection http://www.lightningmaps.org/

[+] chinathrow|10 years ago|reply

Yes, lots of folks are counting.

http://www.blitzortung.org/

[+] sz4kerto|10 years ago|reply

Building power management systems can detect this.

[+] larrys|10 years ago|reply

Could also be security cameras in addition to what others mentioned.

[+] atlbeer|10 years ago|reply

The real clouds are getting angry at companies misappropriating their name

[+] lgleason|10 years ago|reply

In other news, real clouds file lawsuit against cloud providers for trademark and copyright infringement. :)

[+] upbeatlinux|10 years ago|reply

The cloud strikes back.

[+] chatmasta|10 years ago|reply

> Google said that just 0.000001% of disk space was permanently affected.

Assuming 1 petabyte of total storage at the datacenter, that equates to about 100mb. I wonder how much storage they have there.

[+] circa|10 years ago|reply

"Lightning crashes, an old server dies...."

[+] okadaka|10 years ago|reply

So Google said: "...although... the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain. In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk."

I thought battery is supposed to cover writing the entire write buffer cache to disk in case of power loss. Sounds like they had some badly designed gear which did no account for partial battery charge which should downsize the cache to battery's capacity.

[+] calyhre|10 years ago|reply

Glad to be part of the 0.000001%. We had a tough night because of this outage :(

[+] iradik|10 years ago|reply

At Amazon, when resolving an issue in our internal ticketing system I recall there being "Act of God" as a reason code. Seems applicable here.

[+] JupiterMoon|10 years ago|reply

I think this is a technical term in the insurance world.

[+] rcthompson|10 years ago|reply

It sounds like they mostly lost recently created/stored data that hadn't yet been fully replicated to the required degree of redundancy.

[+] tedchs|10 years ago|reply

Key point from Google's status incident page that clarifies some of the partially-incorrect statements in the press and comments about this:

"In a very small fraction of cases (less than 0.000001% of PD space in europe-west1-b), there was permanent data loss."

https://status.cloud.google.com/incident/compute/15056#57195...

[+] sbecrab4|10 years ago|reply

Quoting the data lost as a percentage of disk space is both accurate and misleading. It makes the impact sound tiny because only recent writes were affected. Obviously writes that were in flight at the time of the incident are going to be a tiny percentage of overall storage. What they don't tell us is what percentage of persistent disks which were in use at the time were affected. That percentage is likely far higher. If only 0.000001% of volumes in use were affected it would never have made the news.

[+] pliu|10 years ago|reply

"The Google Computer Engine (GCE) service allows..."

Did anyone else cringe?

[+] djhworld|10 years ago|reply

Do they know what customers were affected?

I use google drive a lot, I don't track what's in my drive, should I be worried?

135 comments