(no title)
_dan | 2 years ago
We ended up having to strategically shut servers down as well, but the question of what's critical, where is it in the racks, and what's next to it was incredibly difficult to answer. And kinda mind-bending - we'd been thinking of these things as completely virtualised resources for years, suddenly having to consider their physical characteristics as well was a bit of a shock. Just shutting down everything non-critical wasn't enough - there were still now critical non-redundant servers next to each other overheating.
All we had to go on was an outdated racktables install, a readout of the case temperature for each, and a map of which machine was connected to which switch port which loosely related to position in the rack - none completely accurate. In the end we got the colo guys to send a photo of the rack front and back and (though not everything was well labelled) we were able to make some decisions and get things stable again.
In the end one server that was critical but we couldn't get to run cooler we got lucky with - we were able to pull out the server below and (without shutting it down) have the on site engineer drop it down enough to crack the lid open and get some cool air into it to keep it running (albeit with no redundancy and on the edge of thermal shutdown).
We came really close to a major outage that day that would have cost us dearly. I know it sounds like total shambles (and it kinda was) but I miss those days.
SonOfLilit|2 years ago
Took me four reads to find an alternative way to read it other than "we asked some guy that doesn't even work for us to throw it on the ground repeatedly until the cover cracks open", like that Zoolander scene.
_dan|2 years ago
In our defence, he offered. It had hit hour 6 of both the primary and the backup aircon being down, on a very hot day - everyone was way beyond blame and the NOC staff were basically up for any creative solution they could find.
macintux|2 years ago
organsnyder|2 years ago
tetha|2 years ago
I'd have considered calling a few friends from the fire brigade or the catastrophe protection there.
It's not an emergency, yes. However, if you want a situation for your trainees to figure out how to ventilate a building with the force of a thousand gasoline driven fans without anyone complaining and no danger to any person... well be my guest because I can't hear you anymore. Those really big fans are loud AF, seriously.
And, on a more serious note, you could show those blokes how a DC works. Where power goes, what components do, how to handle uncontrolled fire in areas. Would be a major benefit to the local fire fighters.
gottorf|2 years ago
That's hilarious (probably for you as well, in hindsight). Do you feel comfortable naming and shaming this DC, so we know to avoid it?
RHSman2|2 years ago