top | item 30763945

OVHcloud fire: SBG2 data center had no extinguisher, no power cut-out

420 points| detaro | 4 years ago |datacenterdynamics.com | reply

198 comments

order
[+] ovi256|4 years ago|reply
The report points out (Image 8) design choices that contributed to a raging fire, once the fire started:

* no emergency electricity cut-off device, an "economic strategy choice of the site operator" - the electrical room where the fire started was hot, 400C at the door (measured by thermal camera), with meter long electrical arcing from the door, thundering deafening sounds. Making access of utility technicians "difficult" - it took them 2 hours to cut incoming electrical service from the utility. On site UPS devices also had no cut-off, so they kept supplying.

* emergency water network provided only 70m3/h at the site. A firefighting boat arrived, Europa 1, was called, supplying 14.5m3/min max flow rate.

The freeflowing air cooling design, a good design choice as it saves on operating costs for cooling, contributed to nourishing the fire.

[+] stingraycharles|4 years ago|reply
So basically, my takeaway is:

* Freeflowing air in DCs is good for cooling( * It’s bad when you have a fire;

* An improved DC design would allow an operator to $somehow stop the freeflowing air (although one could argue that it’s not free flowing anymore if one can control it);

* I’d like to know how much money really was saved by not allowing the UPSes to be cut off.

I’m very curious how the insurance companies respond, and whether they’ll demand e.g. UPS supplies to be able to cut off. Or maybe in general, the fire department should be more aware of these types of trade-offs being made, and give their approval accordingly.

[+] hinkley|4 years ago|reply
> The freeflowing air cooling design, a good design choice as it saves on operating costs for cooling, contributed to nourishing the fire.

The reason you use forced air in your house instead of being built for natural convection is so you don't die in a fire. Fires are all about convection. Infernos doubly so.

There was an early luxury cruise ship tragedy, I think in New York City. A 'freeflowing air cooling design' and all wood paneling. It caught on fire, so they turned around to come back into port... and burned to the waterline, killing a bunch of people.

Building code for passenger ships got changed to require forced air and limit natural (flammable) materials after this.

[+] gunapologist99|4 years ago|reply
It's a great thing for startups that OVH provides servers at such an amazing low cost. Yes, there is always a risk that someone will make a mistake in the building design. There's a chance that eliminating some redundancies increases the possibility of a failure. There's always a chance that something bad will go wrong.

However, this isn't just a matter of Hanlon's razor (incompetence vs malice), but more of a matter of an intelligent guesstimate of risk versus a lack of knowledge in some areas (wasn't this OVH's very first datacenter?), and a strong focus on reducing costs. Perhaps the latter went too far, and definitely some obvious mistakes were made by not having a universal power cut off of some sort, but dealing with the amount of power on tap in a datacenter is always dangerous, even when there is no fire at all.

I'm not saying we need to give OVH a complete pass on this. I'm just saying that there are a lot of extenuating circumstances and, except for the power cut off, it's not clear that OVH made any choices due to extreme negligence or cost-cutting. In other words, they didn't do anything immoral. At worst, it appears that (even from the most anti-OVH party here), this was just a mistake in the design of a new (at the time) style of datacenter, and it did work properly for many years before there was a problem. Making a mistake is not immoral.

[+] mihaaly|4 years ago|reply
Everything works well, until not.

When it is worked for years without problem, but also without correcting initial sources of risks (learning the business after the clueless first years) that's like learning while driving that the trunk is full of flammable fluid but driving on as "nothing bad happened before".

Buildings should be used within the safety margins of those and prepared for certain type of extrimeties, especially fire. We do not put risky operations into a construction that could not handle or mitigate potential risks (no electricity cutoff, not enough fire extunguising material, no cut off of intense ventillation). Operating a bakery in a barn without alterations comes to mind.

[+] wk_end|4 years ago|reply
Where fire’s concerned, I do think all mistakes (failure to take reasonable precautions, and it sounds like this is the case) really are either negligent or immoral. The costs you save for yourself and your customers don’t factor in the externalities that will impact third parties in the event of a fire - namely, risk of damage to surrounding property, and risk of injury or death for the people who have to put that fire out.
[+] mabbo|4 years ago|reply
Did OVH make it clear to customers that the saved money was coming from lack of safety? I've never worked with them so I don't know. But it seems to me that if you're advertising yourself as equivalent to a competitor only they have fire suppression and you don't, you are obligated to bring that up.

"We save you money because if anything goes wrong, your servers and all the critical data on them will be melted."

[+] StreamBright|4 years ago|reply
This is why it is silly to compare OVH to AWS. Apples to oranges.
[+] ddaalluu2|4 years ago|reply
"Amazing low cost"? They're one of the most expensive data center operators I know.

Hetzner is about half the price. But neither can have more than 3 HDD per machine which is just absurd.

[+] 3boll|4 years ago|reply
For anyone interested, I maintained a server within the affected DC.

OVH provided 3x the price of the service for the downtime. But for the recovery we needed to buy a new server from them as our backups were only accesible from dedicated machines... At the end we basically received 2x the price of the server when discounting the temp machine. Communication during the downtime was not bad from OVH side, taking into account the huge amount of affected servers. At the end, as a small customer I cant do or ask for much more. As it's not worth neither the money or the time. IMHO, 3x is not covering any business loses for anyone. We got our own backups within the OVH network and it took 3 additional days to be able to access them as the network was a mess after the fire. That for some business is going to be a huge sum.

[+] hinkley|4 years ago|reply
This is the illusion of SLAs. These money-back-guarantees only refund the cost of the service. They aren't insurance for loss of business.

If I'm using your service for $5k a month and making less than $5k a month because of it, my finance people might rightfully ask where my head is at. 2x is better than 1x, but in general I think we are looking for higher rates of return for that. This hardware has to pay for my development and really my entire payroll after all.

I also can't trust that you losing $5k an hour motivates you to fix the problem ASAP as me losing $5k an hour, let alone if I'm losing more than that.

[+] 3boll|4 years ago|reply
Since then I have changed my backup strategy... Now with diff providers. Ready for the next fire;)
[+] rosndo|4 years ago|reply
With this kind of providers you’re kind of expected to have your own online backups in order to avoid outages. The price point certainly allows for it.
[+] moralestapia|4 years ago|reply
>no general electrical cut-off switch

This is so weird I can hardly believe it, maybe some details were lost in the writing.

Were they connecting everything directly to the grid? Even the most basic electrical setup goes through a fusebox with switches that turn everything on/off.

Perhaps that box was burning as well, or the fire blocked access to it, idk. If there truly was no way to cut power from the site then, wow, that was just an abysmally stupid decision.

[+] mardifoufs|4 years ago|reply
It wasn't directly plugged into the grid but the report says:

>Aucun organe de coupure externe

Meaning there was no way to cut external power from going into the the DC. But the DC also had

>4 niveaux de reprise automatique de courant [very roughly translates to "4 layers of automatic power restart"]

Which I guess kept switching the power back on. So the only way to completely shut down power was by cutting off the building from the grid... with a switch that didn't exist. I don't know anything about data centers but that does not make a lot of sense to me, why would you want your safety systems to cycle back off automatically?

The report also states that the lack of main switch was due to "economic decisions made by the company", but does not give further details about that multi layered restart system

[+] kelp|4 years ago|reply
This does seem crazy to me. I've probably toured 30-40 datacenter facilities in my career, and they ALWAYS have a big red EPO (Emergency Power Off) button at the major exits to each datahall. I've been in plenty of facilities all over the US, Europe and Japan. (I've also had to deal with the fallout from outages caused by someone accidentally pushing that button. I think they thought it would open the door. Later on those buttons were always covered with a clear plastic housing)

Though never in France and everything I've spent time in was a a retail or wholesale provider, eg selling space to other companies. Not something owned and operated by a single company.

[+] phire|4 years ago|reply
They said they couldn't access the electrical room due to electrical arcs (and fire?). That's where the switches and fuse boxes would have been located.

What they wanted is a switch outside the building, that could cut power to the whole building without having to get to the electrical room.

[+] darkwater|4 years ago|reply
I think they refer to the absence of one single mains switch to turn off everything in the datacenter. It took 3 hours for the firefighter to find all the individual switches and turn them off, according to the report.
[+] malfist|4 years ago|reply
It said there were meter long electrical arcs in the "power room". If I had to guess, I'd guess there was a shutoff in the power room, but not one accessible outside of it.
[+] fmajid|4 years ago|reply
Irresponsible design. It's not just the fire and the damage to businesses, but the report lists concerns about lead from the UPS batteries being spread all the way to Germany as a result of the plume from the fire, as well as in the water from the firemen. Fortunately, in this case, they measured the water and found no significant amounts, nor did the German environmental authorities, but it could easily have been as bad as the Notre-Dame fire where a huge chunk of innermost Paris was contaminated by lead from the destroyed roof.
[+] userbinator|4 years ago|reply
Of all the things to be the worried about being released by a fire...?!

No. Lead is way down on the list. I'd be far more concerned about the other carcinogens from stuff burning.

[+] staticelf|4 years ago|reply
At the time of the fire I used nodechef that hosted their services on OVH and seemingly all their backups as well (in the same datacenter). Turns out when extraordinary events happen promises of backups and such aren't always kept.

We lost some data because of that, luckily we had our own backups. A good reminder to make sure you have backups and that they are working correctly. No matter what promises anyone gives you should always have your own backup strategy that's disconnected from the vendor you use.

It was Nodechefs fault, not OVH obviously but perhaps it's interesting for others.

[+] Cthulhu_|4 years ago|reply
Make sure to read the T's and C's and availability / retention rates closely; it's a process that involves decoding the legalese and trying to associate it with RL situations.

Amazon's S3 for example offers a 99.999999999% data durability guarantee, with other bits implying they can withstand a datacenter going up in flames. But there's two caveats there; data availability is lower (so if that datacenter goes up in flame your data may not be lost, but it may also not be directly accessible until they restore their backups), and if they do lose data, what is the consequences to them? It'll be a financial compensation at best.

[+] mkj|4 years ago|reply

[deleted]

[+] catwell|4 years ago|reply
Disclaimer: I am French and I had servers (well, my employer did, and I was the admin) in that DC.

From the report (not the post): something that really annoyed the firemen is that not only was there no universal electrical cut-off, there were 4 different electrical backups, which they had to figure out how to cut off one by one...

It's the same thing as the self-cooling design: OVH optimized for what typically matters in a DC. You want an energy-efficient design and you never want power to go down.

Well... except when you do. I suppose by now them and other hosting providers have taken that issue into account and are modifying their DC designs accordingly.

[+] bayindirh|4 years ago|reply
A proper datacenter can be both efficient and safe. Add solid blidners to close the chimneys, and an oxygen suppressant system (NOVEC, etc.), and add motorized switch-fuses. They can all be orchestrated by a PLC, and a fire alarm/control system.

Close blinds, release NOVEC, disable power rails to computers. That's all. It might not stop everything, but it can help a lot.

[+] Zealotux|4 years ago|reply
I remember back when OVH was smugly mocking anyone who had concerns about their WC system, to be fair the concerns were kind of ridiculous but ultimately their system _did_ fail catastrophically.

Glad to have left that company years before the fire, never doing business with them ever again.

[+] belter|4 years ago|reply
Used them a few years ago with no complaints, but sounds like they need to then be a bit more specific on their compliance page: https://www.ovhcloud.com/fr/enterprise/certification-conform...

Edit: To clarify, they are claiming generic compliance with for example ISO/IEC 27001, 27017 et 27018. Does not look like it from the incident report. Maybe only some of their offers, and that is the detail I am referring to.

[+] rurban|4 years ago|reply
As architect I don't understand how you can blame OVH here. Local fire code is within the city council, which uses the local fire department, and then with the maintainer. Without proper planning and fire code measures you won't get a permit. How did the constructor got a permit at all? This is a wooden building for F 60?? This must be F90 for starters. Then the electrical planning: How did they get a permit at all?

OVH was only a renter. First I would blame local fire department for not enforcing their fire codes.

[+] tuananh|4 years ago|reply
seems like they have done an exceptionly well done, high availability power design :D
[+] redm|4 years ago|reply
I've spent a lot of time in US Data-centers, and I can't remember seeing one without an EOD. I've even seen them pressed a few times accidentally when people exiting the floor thought they were "exit" buttons.

Are they talking about the lack of an EOD or something more fundamental to the power system? It almost seems like you need something external to the facility in case of a fire where it's unsafe to enter the building.

[+] pid-1|4 years ago|reply
Since we are finding that post facto, I really doubt it. It means their customers did not care about compliance.
[+] richardfey|4 years ago|reply
> I suppose by now them and other hosting providers have taken that issue into account and are modifying their DC designs accordingly.

Love your faith in humanity but I think we'd both be surprised about how little other hosting providers have changed after this incident.

[+] watsocd|4 years ago|reply
I have done work on a power system for a data center in a previous life.

It is not easy to shut power off in a data center. They are designed that way intentionally. Yes, it is fairly easy to shut down utility power. But then you have automatic diesel generators that will start. If you shut them down, then you have battery powered UPS units.

In the building that I knew well in Canada, they had a well guarded and covered button behind the security desk that was labeled 'EPO': Emergency Power Off.

This button would send a signal to all systems (utility switches, diesel generator, and UPS units) to immediately shutdown or don't even start (diesels).

[+] detaro|4 years ago|reply
(Title truncated from original "OVHcloud fire report: SBG2 data center had wooden ceilings, no extinguisher, and no power cut-out" to fit HN length limit)
[+] gcoguiec|4 years ago|reply
As a former OVH employee, I was constantly reminded to avoid "surqualité" and try to understand OVH's "bricolocracie" better. I never could. I hope this shock will incite the company to improve the quality of its infrastructure and products. I only wish OVH the best.
[+] boringg|4 years ago|reply
Unrelated tidbit of knowledge i learned yesterday. The first fire brigade was created by Julius Caesars partner/general Crassus. He put together a team of about 500 people to put out fires that happened on a near daily basis in Rome. The catch: he was a land speculator who would flip burnt homes on the cheap so his brigade would run to a fire but until the owner sold to Crassus on the cheap they would watch the home burn.

Talk about a hard nose business or completely unethical leverage, wow.

[+] nervoustwit|4 years ago|reply
"This meant that some OVH customers found that their servers continued running after the fire started." So it's not ALL bad news.
[+] londons_explore|4 years ago|reply
In a datacenter fire, the risk to human life is very small (has anyone ever died in a datacenter fire, apart from being suffocated by a halon system?).

It's just property risk, data loss and service downtime. Therefore, it's a business decision.

I have worked in a business that, during a fire, prioritized maintaining service uptime over putting the fire out. The end result: They had to buy more new servers, but customer workloads were migrated away within 15 minutes and saw no outage. For them, it was the right decision.

[+] michaelt|4 years ago|reply
Well, most fires involve risk to human life if firefighters have to enter the building to extinguish it.
[+] AnIdiotOnTheNet|4 years ago|reply
I have to imagine that smoke from the burning computing equipment isn't the healthiest thing for anyone downwind of the fire.
[+] taubek|4 years ago|reply
Aren't there some regulations that would prevent such building to get operating permit?
[+] bilekas|4 years ago|reply
I remember this fire, it was definitely one of the biggest disruptions for us at the time.

> it took three hours to cut off the power supply because there was no universal cut-off.

This seems really egregious given the nature of data centers, I would argue less so than the wooden ceilings, those were treated to survive an hour long fire, one would thing if the building was grounded with no electricity the fire could have been handled much faster.

[+] nervoustwit|4 years ago|reply
"This meant that some OVH customers found that their servers continued running after the fire started." It's not ALL bad news.
[+] ge96|4 years ago|reply
I have been using their services for several years now. I remember one time there was a fire and they reimbursed me for the down time.
[+] ksec|4 years ago|reply
And now I am wondering if Hetzner has similar problem or if their DC are better designed.

In the old days the two are frequently mentioned together.

[+] annoyingnoob|4 years ago|reply
I deal with a ton of spam/phishing/malware that comes from OVH datacenters - OVH does nothing with complaints. Sometimes you really do reap what you sow.