top | item 38588312

(no title)

I don't know about others, but I can't help but smile when I read the detailed series of events in aviation postmortems. To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry. I say that sincerely since mistakes are going to happen and in my view robustness has less to do with the number of mistakes but how one responds to them.

Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

And finally, the biggest of kudos to the Kyra Dempsey the writer. What an approachable article despite being (necessarily) heavy on the engineering content.

discuss

WalterBright|2 years ago

As a former Boeing engineer, other industries can learn a great deal from how airplanes are designed. The Fukushima and Deepwater Horizon disasters were both "zipper" failures that showed little thought was given to "when X fails, then what?"

Note I wrote when X fails, not if X fails. It's a different way of thinking.

lloeki|2 years ago

When I worked in an industrial context, some coding tasks would seem trivial to today's Joe Random software dev, but we had to be constantly thinking about failure modes: from degraded modes that would keep a plant 100% operative 100% of the time in spite of some component being down, to driving a 10m high oven has the opportunity to break airborne water molecules from mere ambient humidity into hydrogen whose buildups could be dangerously explosive if some parameters were not kept in check, implying that the code/system has to have a number of contingency plans. "Sane default" suddenly has a very tangible meaning.

f1shy|2 years ago

As an engineer I think a lot about tradeoffs of cost vs other criteria. There is little I can learn from nuclear or aviation industry, as the cost structure ist so completely different. I’m very happy that the costs of safety in aviation are very good accepted, but I understand that few people are willing to pay similar costs for other things like, say, cars.

arendtio|2 years ago

In the context of disasters that happened due to software failures (e.g. Ariane 5 [1]), one of my professors used to tell us, that software doesn't break somewhen but is broken from the beginning.

I like the idea of thinking 'when' instead of 'if', but the verdict should be even harder when it comes to software engineering because it has this rare material at its disposal, which doesn't degrade over time.

[1] https://en.wikipedia.org/wiki/Ariane_5#Notable_launches

WalterBright|2 years ago

An example of zipper failure in the Airbus incident is when a wire bundle gets cut, all the functions of all the wires in that bundle are lost. Having two or more smaller bundles physically separated would greatly reduce that risk. Certainly, having the primary and the backup system in the same bundle is a bad idea.

On the 757, one set of control cables runs under the floor. The backup set runs in the ceiling.

laydn|2 years ago

What's fascinating about airplane design for me is not the huge technical complexity, but rather, the way it is designed such that a lot of its subsystems are serviceable by technicians so quickly and reliably, not just in a fully controlled environment like a maintenance hangar, but right on the tarmac, waiting for takeoff.

cedivad|2 years ago

> When my AoA sensor fails, then what?

crickets, let's just randomise which sensor we use during boot, that ought to do it!

asystole|2 years ago

I agree in principle, but I don't think industries should be looking at current-day Boeing's engineering practices except for an example of how a proud company's culture can rot from the inside out with fatal consequences.

oefnak|2 years ago

Are you serious in saying that other industries could learn from Boeing?

sylens|2 years ago

I think many of us are so used to working with software, with its constant need for adaptation and modification in order to meet an ever growing list of integration requirements, that we forget the benefits of working with a finalized spec with known constants like melting points, air pressure, and gravity.

abid786|2 years ago

Completely agree - I think it can go one of two ways. Software is more malleable than airplanes are and that also comes with downsides (like how much time and effort it takes to bring a new plane to the market)

WalterBright|2 years ago

Airliners face constantly changing specifications. No two airliners are built the same.

mzi|2 years ago

It took hundreds of subject experts from ten organizations in seven countries almost three years to reach that conclusion.

Here at HN we want a post mortem for a cloud failure in a matter of hours.

modernpacifist|2 years ago

> Here at HN we want a post mortem for a cloud failure in a matter of hours.

I'll go one further - I've yet to finish writing a postmortem on one incident before the next one happens. I also have my doubts that folks wanting a PM in O(hours) actually care about its contents/findings/remediations - its just a tick box in the process of day-to-day ops.

thaumasiotes|2 years ago

Something similar that struck me was that, in early February, Russia invaded Ukraine.

And then, I saw an endless stream of aggrieved comments from people who were personally outraged that the outcome, whatever it might be, hadn't been finalized yet at the late, late date of... late February.

mlrtime|2 years ago

I work at mid tier FAANG, our SLA for post mortems have SLA in the 7-14 day period. Nobody seriously wants a full PM in hours.

They may want a mitigation or RCA in hours, but even AWS gives us NDA restricted PMs in > 24 hours.

bitcharmer|2 years ago

Apples to oranges

crabmusket|2 years ago

> To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry.

And to be able to reconstruct the chain of events after the components in question have exploded and been scattered throughout south-east Asia is incredible.

Gare|2 years ago

My impressiom was that the defective part was still inside the engine when it landed.

nextos|2 years ago

Aviation is great because the industry learns so much after incidents and accidents. There is a culture of trying to improve, rather than merely seeking culprits.

However, I have been told by an insider that supply chain integrity is an underappreciated issue. Someone has been caught selling fake plane parts through an elaborate scheme, and there are other suspicious suppliers, which is a bit unsettling:

"Safran confirmed the fraudulent documentation, launching an investigation that found thousands of parts across at least 126 CFM56 engines were sold without a legitimate airworthiness certificate."

https://www.businessinsider.com/scammer-fooled-us-airlines-b...

EdwardDiego|2 years ago

Admiral Cloudberg has covered a case where counterfeit or EOL-but-with-new-paperworks components were involved in a crash.

https://admiralcloudberg.medium.com/riven-by-deceit-the-cras...

inglor_cz|2 years ago

I suspect this is precisely what is happening in Russian civil aviation now. No legit parts supplied, so there will be a lot of fake/problematic parts imported through black channels.

bambax|2 years ago

The Checklist Manifesto (2009) is a great short book that shows how using simple checklists would help immensely in many different industries, esp. in medical (the author is a surgeon).

Checklists of course are not the same as detailed post-mortems but they belong to the same way of thinking. And they would cost pretty much nothing to implement.

Also CRM: it's very important to have a culture where underlings feel they can speak up when something doesn't look right -- or when a checklist item is overlooked, for that matter.

sgarland|2 years ago

Yes, but they do have one critical failure mode: that the checklist failed to account for something (or that an expected reaction to a step being performed didn’t occur).

I was a submarine nuclear reactor operator, and one of my Commanding Officers once ordered that we stop using checklists during routine operations for precisely this reason. Instead, we had to fully read and parse the source documentation for every step. Before, while we of course had them open, they served as more of a backstop.

His argument – which I to some extent agree with – was that by reading the source documentation every time, we would better engage our critical thinking and assess plant conditions, rather than skimming a simplified version. To be clear, the checklists had been generated and approved by our Engineering Officer, but they were still simplifications.

jacquesm|2 years ago

Checklists are great if you use them properly: to make sure you remember. Checklists are dangerous when they are used improperly: to replace or shut-down critical thinking.

Simon_ORourke|2 years ago

A colleague of mine came from a major aviation design company before joining tech and said they were in a state of culture shock at how critical systems were designed and monitored. Even if there are no hard real time requirements for a billing system, this guy was surprised at just how lax tech design patterns tended to be.

Horffupolde|2 years ago

If 200 people died after a db instance crashed, software would be equal in that regard.

girvo|2 years ago

To prove this, software that deals with medical stuff is somewhat more like aviation.

mlrtime|2 years ago

Likewise, in "aviation" when the entertainment system completely fails in a 4 hour flight, there is most like no post mortem at all. They turn it off/on again just like most of us.

mewpmewp2|2 years ago

Some people who think this is ideal for any sort of software tech sound they would also want a 3 hour post mortem with whoever designed the rooms, after slightly stubbing a toe.

blauditore|2 years ago

This kind of makes sense, but it is only possible because of public pressure/interest. Many people are irrationally emotional about flying (fear, excitement etc.), that's why articles and documentaries like this post are so popular.

On a side note, that's also why there's all the nomsense security theater at airports.

jstanley|2 years ago

> robustness has less to do with the number of mistakes but how one responds to them

It must have something to do with the number of mistakes, otherwise it's all a waste of time!

It's all well and good responding to mistakes as thoroughly as possible, but if it's not reducing the number of mistakes, what's it all for?

krisoft|2 years ago

> It must have something to do with the number of mistakes, otherwise it's all a waste of time!

Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)

One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.

The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.

They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.

> if it's not reducing the number of mistakes, what's it all for

For reducing their impact.

colechristensen|2 years ago

Aerospace things have to be like this or they just wouldn’t work at all. There are just too many points of failure and redundancy is capped by physics. When there’s a million things which if they went wrong could cause catastrophic failure, you have to be really good at learning how to not make mistakes.

WalterBright|2 years ago

> you have to be really good at learning how to not make mistakes.

Not exactly. The idea is not not making mistakes, it's whatcha gonna do about X when (not if) it fails.

mewpmewp2|2 years ago

> Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

There's a slight difference in terms of what kind of damage an airplane malfunctioning causes compared to a button on an e-commerce shop rendering improperly for one of the browsers. My point is that the level of investment in reliability and process should be proportional to the potential damage of any incidents.

solids|2 years ago

I agree, and also I enjoy the attitude. While in my profession the postmortems goal is finding who to blame, here the attitude is towards preventing it to happen again, no matter what. Or at least that’s how I feel.

mewpmewp2|2 years ago

Your profession? Or you mean your company? Unless it's a very specific profession I would not know, it would usually imply that the company is dysfunctional.

bomewish|2 years ago

Richard Hipp talks a lot about how SQLite adopted testing procedures directly from aviation.

switch007|2 years ago

> I can only hope that the software/tech industry can one day be an equal in this regard

I’d love to be an engineer with unlimited time budget to worry about “when, not if, X happens” (to quote a sibling comment).

But people don’t tend to die when we mess up, so we don’t get that budget.

akarve|2 years ago

Hard agree. Civil & mechanical engineering have a culture and history of blameless analysis of failure. Software engineering could learn from them.

See the excellent To Engineer is Human in just this topic of analyzed failures in civil engineering.