(no title)
modernpacifist | 2 years ago
Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.
And finally, the biggest of kudos to the Kyra Dempsey the writer. What an approachable article despite being (necessarily) heavy on the engineering content.
WalterBright|2 years ago
Note I wrote when X fails, not if X fails. It's a different way of thinking.
lloeki|2 years ago
f1shy|2 years ago
arendtio|2 years ago
I like the idea of thinking 'when' instead of 'if', but the verdict should be even harder when it comes to software engineering because it has this rare material at its disposal, which doesn't degrade over time.
[1] https://en.wikipedia.org/wiki/Ariane_5#Notable_launches
WalterBright|2 years ago
On the 757, one set of control cables runs under the floor. The backup set runs in the ceiling.
laydn|2 years ago
cedivad|2 years ago
crickets, let's just randomise which sensor we use during boot, that ought to do it!
asystole|2 years ago
oefnak|2 years ago
sylens|2 years ago
abid786|2 years ago
WalterBright|2 years ago
mzi|2 years ago
Here at HN we want a post mortem for a cloud failure in a matter of hours.
modernpacifist|2 years ago
I'll go one further - I've yet to finish writing a postmortem on one incident before the next one happens. I also have my doubts that folks wanting a PM in O(hours) actually care about its contents/findings/remediations - its just a tick box in the process of day-to-day ops.
thaumasiotes|2 years ago
And then, I saw an endless stream of aggrieved comments from people who were personally outraged that the outcome, whatever it might be, hadn't been finalized yet at the late, late date of... late February.
mlrtime|2 years ago
They may want a mitigation or RCA in hours, but even AWS gives us NDA restricted PMs in > 24 hours.
bitcharmer|2 years ago
crabmusket|2 years ago
And to be able to reconstruct the chain of events after the components in question have exploded and been scattered throughout south-east Asia is incredible.
Gare|2 years ago
nextos|2 years ago
However, I have been told by an insider that supply chain integrity is an underappreciated issue. Someone has been caught selling fake plane parts through an elaborate scheme, and there are other suspicious suppliers, which is a bit unsettling:
"Safran confirmed the fraudulent documentation, launching an investigation that found thousands of parts across at least 126 CFM56 engines were sold without a legitimate airworthiness certificate."
https://www.businessinsider.com/scammer-fooled-us-airlines-b...
EdwardDiego|2 years ago
https://admiralcloudberg.medium.com/riven-by-deceit-the-cras...
inglor_cz|2 years ago
bambax|2 years ago
Checklists of course are not the same as detailed post-mortems but they belong to the same way of thinking. And they would cost pretty much nothing to implement.
Also CRM: it's very important to have a culture where underlings feel they can speak up when something doesn't look right -- or when a checklist item is overlooked, for that matter.
sgarland|2 years ago
I was a submarine nuclear reactor operator, and one of my Commanding Officers once ordered that we stop using checklists during routine operations for precisely this reason. Instead, we had to fully read and parse the source documentation for every step. Before, while we of course had them open, they served as more of a backstop.
His argument – which I to some extent agree with – was that by reading the source documentation every time, we would better engage our critical thinking and assess plant conditions, rather than skimming a simplified version. To be clear, the checklists had been generated and approved by our Engineering Officer, but they were still simplifications.
jacquesm|2 years ago
Simon_ORourke|2 years ago
Horffupolde|2 years ago
girvo|2 years ago
mlrtime|2 years ago
mewpmewp2|2 years ago
blauditore|2 years ago
On a side note, that's also why there's all the nomsense security theater at airports.
jstanley|2 years ago
It must have something to do with the number of mistakes, otherwise it's all a waste of time!
It's all well and good responding to mistakes as thoroughly as possible, but if it's not reducing the number of mistakes, what's it all for?
krisoft|2 years ago
Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)
One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.
The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.
They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.
> if it's not reducing the number of mistakes, what's it all for
For reducing their impact.
colechristensen|2 years ago
WalterBright|2 years ago
Not exactly. The idea is not not making mistakes, it's whatcha gonna do about X when (not if) it fails.
mewpmewp2|2 years ago
There's a slight difference in terms of what kind of damage an airplane malfunctioning causes compared to a button on an e-commerce shop rendering improperly for one of the browsers. My point is that the level of investment in reliability and process should be proportional to the potential damage of any incidents.
solids|2 years ago
mewpmewp2|2 years ago
bomewish|2 years ago
switch007|2 years ago
I’d love to be an engineer with unlimited time budget to worry about “when, not if, X happens” (to quote a sibling comment).
But people don’t tend to die when we mess up, so we don’t get that budget.
akarve|2 years ago
See the excellent To Engineer is Human in just this topic of analyzed failures in civil engineering.