The report points out (Image 8) design choices that contributed to a raging fire, once the fire started:
* no emergency electricity cut-off device, an "economic strategy choice of the site operator" - the electrical room where the fire started was hot, 400C at the door (measured by thermal camera), with meter long electrical arcing from the door, thundering deafening sounds. Making access of utility technicians "difficult" - it took them 2 hours to cut incoming electrical service from the utility. On site UPS devices also had no cut-off, so they kept supplying.
* emergency water network provided only 70m3/h at the site. A firefighting boat arrived, Europa 1, was called, supplying 14.5m3/min max flow rate.
The freeflowing air cooling design, a good design choice as it saves on operating costs for cooling, contributed to nourishing the fire.
* Freeflowing air in DCs is good for cooling(
* It’s bad when you have a fire;
* An improved DC design would allow an operator to $somehow stop the freeflowing air (although one could argue that it’s not free flowing anymore if one can control it);
* I’d like to know how much money really was saved by not allowing the UPSes to be cut off.
I’m very curious how the insurance companies respond, and whether they’ll demand e.g. UPS supplies to be able to cut off. Or maybe in general, the fire department should be more aware of these types of trade-offs being made, and give their approval accordingly.
> The freeflowing air cooling design, a good design choice as it saves on operating costs for cooling, contributed to nourishing the fire.
The reason you use forced air in your house instead of being built for natural convection is so you don't die in a fire. Fires are all about convection. Infernos doubly so.
There was an early luxury cruise ship tragedy, I think in New York City. A 'freeflowing air cooling design' and all wood paneling. It caught on fire, so they turned around to come back into port... and burned to the waterline, killing a bunch of people.
Building code for passenger ships got changed to require forced air and limit natural (flammable) materials after this.
It's a great thing for startups that OVH provides servers at such an amazing low cost. Yes, there is always a risk that someone will make a mistake in the building design. There's a chance that eliminating some redundancies increases the possibility of a failure. There's always a chance that something bad will go wrong.
However, this isn't just a matter of Hanlon's razor (incompetence vs malice), but more of a matter of an intelligent guesstimate of risk versus a lack of knowledge in some areas (wasn't this OVH's very first datacenter?), and a strong focus on reducing costs. Perhaps the latter went too far, and definitely some obvious mistakes were made by not having a universal power cut off of some sort, but dealing with the amount of power on tap in a datacenter is always dangerous, even when there is no fire at all.
I'm not saying we need to give OVH a complete pass on this. I'm just saying that there are a lot of extenuating circumstances and, except for the power cut off, it's not clear that OVH made any choices due to extreme negligence or cost-cutting. In other words, they didn't do anything immoral. At worst, it appears that (even from the most anti-OVH party here), this was just a mistake in the design of a new (at the time) style of datacenter, and it did work properly for many years before there was a problem. Making a mistake is not immoral.
When it is worked for years without problem, but also without correcting initial sources of risks (learning the business after the clueless first years) that's like learning while driving that the trunk is full of flammable fluid but driving on as "nothing bad happened before".
Buildings should be used within the safety margins of those and prepared for certain type of extrimeties, especially fire. We do not put risky operations into a construction that could not handle or mitigate potential risks (no electricity cutoff, not enough fire extunguising material, no cut off of intense ventillation). Operating a bakery in a barn without alterations comes to mind.
Where fire’s concerned, I do think all mistakes (failure to take reasonable precautions, and it sounds like this is the case) really are either negligent or immoral. The costs you save for yourself and your customers don’t factor in the externalities that will impact third parties in the event of a fire - namely, risk of damage to surrounding property, and risk of injury or death for the people who have to put that fire out.
Did OVH make it clear to customers that the saved money was coming from lack of safety? I've never worked with them so I don't know. But it seems to me that if you're advertising yourself as equivalent to a competitor only they have fire suppression and you don't, you are obligated to bring that up.
"We save you money because if anything goes wrong, your servers and all the critical data on them will be melted."
For anyone interested, I maintained a server within the affected DC.
OVH provided 3x the price of the service for the downtime. But for the recovery we needed to buy a new server from them as our backups were only accesible from dedicated machines...
At the end we basically received 2x the price of the server when discounting the temp machine.
Communication during the downtime was not bad from OVH side, taking into account the huge amount of affected servers.
At the end, as a small customer I cant do or ask for much more. As it's not worth neither the money or the time.
IMHO, 3x is not covering any business loses for anyone. We got our own backups within the OVH network and it took 3 additional days to be able to access them as the network was a mess after the fire. That for some business is going to be a huge sum.
This is the illusion of SLAs. These money-back-guarantees only refund the cost of the service. They aren't insurance for loss of business.
If I'm using your service for $5k a month and making less than $5k a month because of it, my finance people might rightfully ask where my head is at. 2x is better than 1x, but in general I think we are looking for higher rates of return for that. This hardware has to pay for my development and really my entire payroll after all.
I also can't trust that you losing $5k an hour motivates you to fix the problem ASAP as me losing $5k an hour, let alone if I'm losing more than that.
With this kind of providers you’re kind of expected to have your own online backups in order to avoid outages. The price point certainly allows for it.
This is so weird I can hardly believe it, maybe some details were lost in the writing.
Were they connecting everything directly to the grid? Even the most basic electrical setup goes through a fusebox with switches that turn everything on/off.
Perhaps that box was burning as well, or the fire blocked access to it, idk. If there truly was no way to cut power from the site then, wow, that was just an abysmally stupid decision.
It wasn't directly plugged into the grid but the report says:
>Aucun organe de coupure externe
Meaning there was no way to cut external power from going into the the DC. But the DC also had
>4 niveaux de reprise automatique de courant [very roughly translates to "4 layers of automatic power restart"]
Which I guess kept switching the power back on. So the only way to completely shut down power was by cutting off the building from the grid... with a switch that didn't exist. I don't know anything about data centers but that does not make a lot of sense to me, why would you want your safety systems to cycle back off automatically?
The report also states that the lack of main switch was due to "economic decisions made by the company", but does not give further details about that multi layered restart system
This does seem crazy to me. I've probably toured 30-40 datacenter facilities in my career, and they ALWAYS have a big red EPO (Emergency Power Off) button at the major exits to each datahall. I've been in plenty of facilities all over the US, Europe and Japan. (I've also had to deal with the fallout from outages caused by someone accidentally pushing that button. I think they thought it would open the door. Later on those buttons were always covered with a clear plastic housing)
Though never in France and everything I've spent time in was a a retail or wholesale provider, eg selling space to other companies. Not something owned and operated by a single company.
They said they couldn't access the electrical room due to electrical arcs (and fire?). That's where the switches and fuse boxes would have been located.
What they wanted is a switch outside the building, that could cut power to the whole building without having to get to the electrical room.
I think they refer to the absence of one single mains switch to turn off everything in the datacenter. It took 3 hours for the firefighter to find all the individual switches and turn them off, according to the report.
It said there were meter long electrical arcs in the "power room". If I had to guess, I'd guess there was a shutoff in the power room, but not one accessible outside of it.
Irresponsible design. It's not just the fire and the damage to businesses, but the report lists concerns about lead from the UPS batteries being spread all the way to Germany as a result of the plume from the fire, as well as in the water from the firemen. Fortunately, in this case, they measured the water and found no significant amounts, nor did the German environmental authorities, but it could easily have been as bad as the Notre-Dame fire where a huge chunk of innermost Paris was contaminated by lead from the destroyed roof.
At the time of the fire I used nodechef that hosted their services on OVH and seemingly all their backups as well (in the same datacenter). Turns out when extraordinary events happen promises of backups and such aren't always kept.
We lost some data because of that, luckily we had our own backups. A good reminder to make sure you have backups and that they are working correctly. No matter what promises anyone gives you should always have your own backup strategy that's disconnected from the vendor you use.
It was Nodechefs fault, not OVH obviously but perhaps it's interesting for others.
Make sure to read the T's and C's and availability / retention rates closely; it's a process that involves decoding the legalese and trying to associate it with RL situations.
Amazon's S3 for example offers a 99.999999999% data durability guarantee, with other bits implying they can withstand a datacenter going up in flames. But there's two caveats there; data availability is lower (so if that datacenter goes up in flame your data may not be lost, but it may also not be directly accessible until they restore their backups), and if they do lose data, what is the consequences to them? It'll be a financial compensation at best.
Disclaimer: I am French and I had servers (well, my employer did, and I was the admin) in that DC.
From the report (not the post): something that really annoyed the firemen is that not only was there no universal electrical cut-off, there were 4 different electrical backups, which they had to figure out how to cut off one by one...
It's the same thing as the self-cooling design: OVH optimized for what typically matters in a DC. You want an energy-efficient design and you never want power to go down.
Well... except when you do. I suppose by now them and other hosting providers have taken that issue into account and are modifying their DC designs accordingly.
A proper datacenter can be both efficient and safe. Add solid blidners to close the chimneys, and an oxygen suppressant system (NOVEC, etc.), and add motorized switch-fuses. They can all be orchestrated by a PLC, and a fire alarm/control system.
Close blinds, release NOVEC, disable power rails to computers. That's all. It might not stop everything, but it can help a lot.
I remember back when OVH was smugly mocking anyone who had concerns about their WC system, to be fair the concerns were kind of ridiculous but ultimately their system _did_ fail catastrophically.
Glad to have left that company years before the fire, never doing business with them ever again.
Edit: To clarify, they are claiming generic compliance with for example ISO/IEC 27001, 27017 et 27018. Does not look like it from the incident report. Maybe only some of their offers, and that is the detail I am referring to.
As architect I don't understand how you can blame OVH here.
Local fire code is within the city council, which uses the local fire department, and then with the maintainer.
Without proper planning and fire code measures you won't get a permit.
How did the constructor got a permit at all? This is a wooden building for F 60?? This must be F90 for starters. Then the electrical planning: How did they get a permit at all?
OVH was only a renter. First I would blame local fire department for not enforcing their fire codes.
I've spent a lot of time in US Data-centers, and I can't remember seeing one without an EOD. I've even seen them pressed a few times accidentally when people exiting the floor thought they were "exit" buttons.
Are they talking about the lack of an EOD or something more fundamental to the power system? It almost seems like you need something external to the facility in case of a fire where it's unsafe to enter the building.
I have done work on a power system for a data center in a previous life.
It is not easy to shut power off in a data center. They are designed that way intentionally. Yes, it is fairly easy to shut down utility power. But then you have automatic diesel generators that will start. If you shut them down, then you have battery powered UPS units.
In the building that I knew well in Canada, they had a well guarded and covered button behind the security desk that was labeled 'EPO': Emergency Power Off.
This button would send a signal to all systems (utility switches, diesel generator, and UPS units) to immediately shutdown or don't even start (diesels).
(Title truncated from original "OVHcloud fire report: SBG2 data center had wooden ceilings, no extinguisher, and no power cut-out" to fit HN length limit)
As a former OVH employee, I was constantly reminded to avoid "surqualité" and try to understand OVH's "bricolocracie" better. I never could. I hope this shock will incite the company to improve the quality of its infrastructure and products. I only wish OVH the best.
Unrelated tidbit of knowledge i learned yesterday. The first fire brigade was created by Julius Caesars partner/general Crassus. He put together a team of about 500 people to put out fires that happened on a near daily basis in Rome. The catch: he was a land speculator who would flip burnt homes on the cheap so his brigade would run to a fire but until the owner sold to Crassus on the cheap they would watch the home burn.
Talk about a hard nose business or completely unethical leverage, wow.
In a datacenter fire, the risk to human life is very small (has anyone ever died in a datacenter fire, apart from being suffocated by a halon system?).
It's just property risk, data loss and service downtime. Therefore, it's a business decision.
I have worked in a business that, during a fire, prioritized maintaining service uptime over putting the fire out. The end result: They had to buy more new servers, but customer workloads were migrated away within 15 minutes and saw no outage. For them, it was the right decision.
I remember this fire, it was definitely one of the biggest disruptions for us at the time.
> it took three hours to cut off the power supply because there was no universal cut-off.
This seems really egregious given the nature of data centers, I would argue less so than the wooden ceilings, those were treated to survive an hour long fire, one would thing if the building was grounded with no electricity the fire could have been handled much faster.
I deal with a ton of spam/phishing/malware that comes from OVH datacenters - OVH does nothing with complaints. Sometimes you really do reap what you sow.
[+] [-] ovi256|4 years ago|reply
* no emergency electricity cut-off device, an "economic strategy choice of the site operator" - the electrical room where the fire started was hot, 400C at the door (measured by thermal camera), with meter long electrical arcing from the door, thundering deafening sounds. Making access of utility technicians "difficult" - it took them 2 hours to cut incoming electrical service from the utility. On site UPS devices also had no cut-off, so they kept supplying.
* emergency water network provided only 70m3/h at the site. A firefighting boat arrived, Europa 1, was called, supplying 14.5m3/min max flow rate.
The freeflowing air cooling design, a good design choice as it saves on operating costs for cooling, contributed to nourishing the fire.
[+] [-] stingraycharles|4 years ago|reply
* Freeflowing air in DCs is good for cooling( * It’s bad when you have a fire;
* An improved DC design would allow an operator to $somehow stop the freeflowing air (although one could argue that it’s not free flowing anymore if one can control it);
* I’d like to know how much money really was saved by not allowing the UPSes to be cut off.
I’m very curious how the insurance companies respond, and whether they’ll demand e.g. UPS supplies to be able to cut off. Or maybe in general, the fire department should be more aware of these types of trade-offs being made, and give their approval accordingly.
[+] [-] hinkley|4 years ago|reply
The reason you use forced air in your house instead of being built for natural convection is so you don't die in a fire. Fires are all about convection. Infernos doubly so.
There was an early luxury cruise ship tragedy, I think in New York City. A 'freeflowing air cooling design' and all wood paneling. It caught on fire, so they turned around to come back into port... and burned to the waterline, killing a bunch of people.
Building code for passenger ships got changed to require forced air and limit natural (flammable) materials after this.
[+] [-] gunapologist99|4 years ago|reply
However, this isn't just a matter of Hanlon's razor (incompetence vs malice), but more of a matter of an intelligent guesstimate of risk versus a lack of knowledge in some areas (wasn't this OVH's very first datacenter?), and a strong focus on reducing costs. Perhaps the latter went too far, and definitely some obvious mistakes were made by not having a universal power cut off of some sort, but dealing with the amount of power on tap in a datacenter is always dangerous, even when there is no fire at all.
I'm not saying we need to give OVH a complete pass on this. I'm just saying that there are a lot of extenuating circumstances and, except for the power cut off, it's not clear that OVH made any choices due to extreme negligence or cost-cutting. In other words, they didn't do anything immoral. At worst, it appears that (even from the most anti-OVH party here), this was just a mistake in the design of a new (at the time) style of datacenter, and it did work properly for many years before there was a problem. Making a mistake is not immoral.
[+] [-] mihaaly|4 years ago|reply
When it is worked for years without problem, but also without correcting initial sources of risks (learning the business after the clueless first years) that's like learning while driving that the trunk is full of flammable fluid but driving on as "nothing bad happened before".
Buildings should be used within the safety margins of those and prepared for certain type of extrimeties, especially fire. We do not put risky operations into a construction that could not handle or mitigate potential risks (no electricity cutoff, not enough fire extunguising material, no cut off of intense ventillation). Operating a bakery in a barn without alterations comes to mind.
[+] [-] wk_end|4 years ago|reply
[+] [-] mabbo|4 years ago|reply
"We save you money because if anything goes wrong, your servers and all the critical data on them will be melted."
[+] [-] StreamBright|4 years ago|reply
[+] [-] ddaalluu2|4 years ago|reply
Hetzner is about half the price. But neither can have more than 3 HDD per machine which is just absurd.
[+] [-] 3boll|4 years ago|reply
OVH provided 3x the price of the service for the downtime. But for the recovery we needed to buy a new server from them as our backups were only accesible from dedicated machines... At the end we basically received 2x the price of the server when discounting the temp machine. Communication during the downtime was not bad from OVH side, taking into account the huge amount of affected servers. At the end, as a small customer I cant do or ask for much more. As it's not worth neither the money or the time. IMHO, 3x is not covering any business loses for anyone. We got our own backups within the OVH network and it took 3 additional days to be able to access them as the network was a mess after the fire. That for some business is going to be a huge sum.
[+] [-] hinkley|4 years ago|reply
If I'm using your service for $5k a month and making less than $5k a month because of it, my finance people might rightfully ask where my head is at. 2x is better than 1x, but in general I think we are looking for higher rates of return for that. This hardware has to pay for my development and really my entire payroll after all.
I also can't trust that you losing $5k an hour motivates you to fix the problem ASAP as me losing $5k an hour, let alone if I'm losing more than that.
[+] [-] 3boll|4 years ago|reply
[+] [-] rosndo|4 years ago|reply
[+] [-] moralestapia|4 years ago|reply
This is so weird I can hardly believe it, maybe some details were lost in the writing.
Were they connecting everything directly to the grid? Even the most basic electrical setup goes through a fusebox with switches that turn everything on/off.
Perhaps that box was burning as well, or the fire blocked access to it, idk. If there truly was no way to cut power from the site then, wow, that was just an abysmally stupid decision.
[+] [-] mardifoufs|4 years ago|reply
>Aucun organe de coupure externe
Meaning there was no way to cut external power from going into the the DC. But the DC also had
>4 niveaux de reprise automatique de courant [very roughly translates to "4 layers of automatic power restart"]
Which I guess kept switching the power back on. So the only way to completely shut down power was by cutting off the building from the grid... with a switch that didn't exist. I don't know anything about data centers but that does not make a lot of sense to me, why would you want your safety systems to cycle back off automatically?
The report also states that the lack of main switch was due to "economic decisions made by the company", but does not give further details about that multi layered restart system
[+] [-] kelp|4 years ago|reply
Though never in France and everything I've spent time in was a a retail or wholesale provider, eg selling space to other companies. Not something owned and operated by a single company.
[+] [-] phire|4 years ago|reply
What they wanted is a switch outside the building, that could cut power to the whole building without having to get to the electrical room.
[+] [-] darkwater|4 years ago|reply
[+] [-] malfist|4 years ago|reply
[+] [-] fmajid|4 years ago|reply
[+] [-] mtmail|4 years ago|reply
For reference, the border to Germany is about 250m from the building https://www.openstreetmap.org/relation/11917092
[+] [-] userbinator|4 years ago|reply
No. Lead is way down on the list. I'd be far more concerned about the other carcinogens from stuff burning.
[+] [-] staticelf|4 years ago|reply
We lost some data because of that, luckily we had our own backups. A good reminder to make sure you have backups and that they are working correctly. No matter what promises anyone gives you should always have your own backup strategy that's disconnected from the vendor you use.
It was Nodechefs fault, not OVH obviously but perhaps it's interesting for others.
[+] [-] Cthulhu_|4 years ago|reply
Amazon's S3 for example offers a 99.999999999% data durability guarantee, with other bits implying they can withstand a datacenter going up in flames. But there's two caveats there; data availability is lower (so if that datacenter goes up in flame your data may not be lost, but it may also not be directly accessible until they restore their backups), and if they do lose data, what is the consequences to them? It'll be a financial compensation at best.
[+] [-] mkj|4 years ago|reply
[deleted]
[+] [-] catwell|4 years ago|reply
From the report (not the post): something that really annoyed the firemen is that not only was there no universal electrical cut-off, there were 4 different electrical backups, which they had to figure out how to cut off one by one...
It's the same thing as the self-cooling design: OVH optimized for what typically matters in a DC. You want an energy-efficient design and you never want power to go down.
Well... except when you do. I suppose by now them and other hosting providers have taken that issue into account and are modifying their DC designs accordingly.
[+] [-] bayindirh|4 years ago|reply
Close blinds, release NOVEC, disable power rails to computers. That's all. It might not stop everything, but it can help a lot.
[+] [-] Zealotux|4 years ago|reply
Glad to have left that company years before the fire, never doing business with them ever again.
[+] [-] belter|4 years ago|reply
Edit: To clarify, they are claiming generic compliance with for example ISO/IEC 27001, 27017 et 27018. Does not look like it from the incident report. Maybe only some of their offers, and that is the detail I am referring to.
[+] [-] rurban|4 years ago|reply
OVH was only a renter. First I would blame local fire department for not enforcing their fire codes.
[+] [-] tuananh|4 years ago|reply
[+] [-] redm|4 years ago|reply
Are they talking about the lack of an EOD or something more fundamental to the power system? It almost seems like you need something external to the facility in case of a fire where it's unsafe to enter the building.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] pid-1|4 years ago|reply
[+] [-] richardfey|4 years ago|reply
Love your faith in humanity but I think we'd both be surprised about how little other hosting providers have changed after this incident.
[+] [-] watsocd|4 years ago|reply
It is not easy to shut power off in a data center. They are designed that way intentionally. Yes, it is fairly easy to shut down utility power. But then you have automatic diesel generators that will start. If you shut them down, then you have battery powered UPS units.
In the building that I knew well in Canada, they had a well guarded and covered button behind the security desk that was labeled 'EPO': Emergency Power Off.
This button would send a signal to all systems (utility switches, diesel generator, and UPS units) to immediately shutdown or don't even start (diesels).
[+] [-] detaro|4 years ago|reply
[+] [-] gcoguiec|4 years ago|reply
[+] [-] GreyStache|4 years ago|reply
[+] [-] boringg|4 years ago|reply
Talk about a hard nose business or completely unethical leverage, wow.
[+] [-] nervoustwit|4 years ago|reply
[+] [-] londons_explore|4 years ago|reply
It's just property risk, data loss and service downtime. Therefore, it's a business decision.
I have worked in a business that, during a fire, prioritized maintaining service uptime over putting the fire out. The end result: They had to buy more new servers, but customer workloads were migrated away within 15 minutes and saw no outage. For them, it was the right decision.
[+] [-] michaelt|4 years ago|reply
[+] [-] AnIdiotOnTheNet|4 years ago|reply
[+] [-] taubek|4 years ago|reply
[+] [-] bilekas|4 years ago|reply
> it took three hours to cut off the power supply because there was no universal cut-off.
This seems really egregious given the nature of data centers, I would argue less so than the wooden ceilings, those were treated to survive an hour long fire, one would thing if the building was grounded with no electricity the fire could have been handled much faster.
[+] [-] nervoustwit|4 years ago|reply
[+] [-] ge96|4 years ago|reply
[+] [-] ksec|4 years ago|reply
In the old days the two are frequently mentioned together.
[+] [-] annoyingnoob|4 years ago|reply