The computer errors from outer space

[+] defrost|3 years ago|reply

> In one much-discussed incident, a 2008 Air Qantas flight over Western Australia fell hundreds of feet twice within 10 minutes, injuring dozens of passengers on board

This was Qantas Flight 72 [1] and should be of interest to all critical system engineers.

There was a fault, an unidentified cause that corrupted data in ONE of three redundant Air Data Inertial Reference Unit (ADIRU)s.

The fault might have been a cosmic ray incident, it may have been EM interference, the usual array of software bugs, software corruption, hardware faults, etc.

The serious design issue was the spike in one unit was badly and incorrectly handled by the "failsafe" logic that was there to ride out bad juju coming from one of three redundant units.

Shit happens, its nice to have toilet paper thats effective when things are about to hit a fan.

[1] https://en.wikipedia.org/wiki/Qantas_Flight_72

Full Air Safety Report (313 pages)

https://www.atsb.gov.au/media/3532398/ao2008070.pdf

[+] contingencies|3 years ago|reply

the spike in one unit was badly and incorrectly handled by the "failsafe" logic

In other words, this sounds like a reasonably easily detectable bug that was allowed in to production due to insufficient system verification, the fault for which lies squarely on Airbus and approving regulators. Arguably the simplest design for redundancy-based high availability systems is that when the system enters a non-quorum state, dissenting inputs should be flagged and discarded and if recovery is initiated their subsystem fully reset.

A more complicated design would be to have a mechanism for evaluating the extent to which inputs differ from those anticipated based upon other known state or inputs (last known position, inertia, airspeed, etc.), and to discard those most unlikely / least supported. However, the tiny fraction of potential failure conditions for which this provides an enhanced recovery path is largely outweighed by the greater complexity of state, processing overhead, lack of transparency in decision making and increased development and testing time (thus system cost).

A better investment of additional system design resources may be in creating a trust metric within the higher-level flight control systems that can reduce risk by avoiding autonomous actions based on subsystems that have entered a low-trust (eg. non-quorum) state.

And indeed, the report (page 21) states:

At 0440:26, one of the aircraft’s three air data inertial reference units (ADIRU 1) started providing incorrect data to other aircraft systems. At 0440:28, the autopilot automatically disconnected, and the captain took manual control of the aircraft.

The report reveals they had nominally independent autopilots running on nominally independent computers and nominally independent flight displays, all of which were of use during incident recovery. However, the number of systems that are reported to have broken (autotrim, cabin pressure, GNSS/RNAV, autobrake, third computer) strongly suggests a deep and systemic failure in the core flight control systems, probably stemming from systemic systems architecture failure to isolate and discard bad data from the malreporting subsystem.

A heterogeneous array of redundant subsystems (ie. from different manufacturers, or with differing dates or places of manufacture) are nominally more likely to survive a fault event. In this event, all the ADIRU units were identical LTN-101 models from Northrop Grumman (who, being a major military avionics contractor, one would have incorrectly assumed would have understood the value of neutron shielding https://www.sciencedirect.com/science/article/pii/B978012819...).

It is also worth noting that the ADIRU units are designed to calculate, maintain and report inertial navigation state. Having this state, sensor errors may compound or persist over time.

However, page 41 reveals that while each autopilot runs on an independent computer, Autopilot 1 on FMGEC 1 trusts ADIRU 1 as its "main" source, and likewise for #2. This suggests a "true quorum feed" is not obtained, possibly for reasons of redundancy (SPOF).

It would be interesting to discuss the current design of such systems with an Airbus engineer and to what extent that incident changed their internal test and design processes and sensor data architecture.

[+] unknown|3 years ago|reply

[deleted]

[+] Tomte|3 years ago|reply

People love to talk about cosmic rays, but in the business (of safety-related systems) it is well-known that most causes of single event upsets (SEU)/soft errors are the microchips (or their packaging) themselves.

"In the terrestrial environment, the key radiations of concern are alpha particles emitted by trace impurities in the chip materials themselves"

(https://www.ti.com/support-quality/faqs/soft-error-rate-faqs...)

[+] rkagerer|3 years ago|reply

Do pacemakers not use ECC?

Not saying that makes 'em cosmic ray proof, but my understanding is it can harden you by an order of magnitude or more (and help guard against glitches from other sources).

If I was designing something this life-crucial from silicon up, every single bit of memory (including registers) would have extra bits for this, and checks would occur everywhere it's moved or used (even over buses inside IC's, regardless of whether they shift data, addresses, control logic, etc). My code would go to great lengths to verify it hasn't been corrupted. The culture of redundancy might be akin to that seen in the Space Shuttle control systems.

Silicon is so damn cheap nowadays it shouldn't be the constraint. That it's hard to source components of such pedigree is disheartening. Basic ECC should be an industry norm for all but the lowest-end chips, and as feature sizes shrink further I hope sound engineering will become a bigger product differentiator. (Look at how HDD's have been getting more bits of ECC as platters become denser).

If more die for our buck doesn't equal less die for ourselves then we're doing something wrong.

Also, here's an interesting paper about radiation effects on pacemakers (from the perspective of cancer treatment) that I came across when searching whether radiation-hardening in this field is a thing: https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1118/1.59725...

[+] smallpipe|3 years ago|reply

> every single bit of memory (including registers) would have extra bits for this, and checks would occur everywhere it's moved or used (even over buses inside IC's, regardless of whether they shift data, addresses, control logic, etc).

That's one of the two approaches used for automotive silicon. The other, which is a bit more expensive but usually considered safer, is to run the same circuit twice in parallel (usually with a few cycles of offset) and compare the outputs. If those outputs don't match, the system is reset into a fail-safe mode.

[+] abfan1127|3 years ago|reply

its been 10 years since I designed them, but back then, no they didn't. Silicon is pretty cheap, but power is very expensive. ECC drives up memory power quite a bit. Both static and dynamic power are impacted.

[+] Gordonjcp|3 years ago|reply

In practice what you do is design it to crash easily and recover quickly. Computers are fast, hearts are slow. Your microcontroller can go "huh, that set of readings is bollocks, let's restart and try again" several times between beats.

Even better, a pacemaker is not literally constantly driving a patient's heart, continuously. It's giving it a nudge every now and again if it's drifting out of whack. Think PLL rather than master clock ;-)

[+] shagie|3 years ago|reply

A YouTube video on the subject : Veritasium - The Universe is Hostile to Computers https://youtu.be/AaZ_RSt0KP8

Some of the same examples are in the article and the video.

[+] pknerd|3 years ago|reply

Google computers also become a victim of outer space "enemies"

FTA(https://www.newyorker.com/magazine/2018/12/10/the-friendship...)

On Sanjay’s monitor, a thick column of 1s and 0s appeared, each row representing an indexed word. Sanjay pointed: a digit that should have been a 0 was a 1. When Jeff and Sanjay put all the missorted words together, they saw a pattern—the same sort of glitch in every word. Their machines’ memory chips had somehow been corrupted.

Sanjay looked at Jeff. For months, Google had been experiencing an increasing number of hardware failures. The problem was that, as Google grew, its computing infrastructure also expanded. Computer hardware rarely failed, until you had enough of it—then it failed all the time. Wires wore down, hard drives fell apart, motherboards overheated. Many machines never worked in the first place; some would unaccountably grow slower. Strange environmental factors came into play. When a supernova explodes, the blast wave creates high-energy particles that scatter in every direction; scientists believe there is a minute chance that one of the errant particles, known as a cosmic ray, can hit a computer chip on Earth, flipping a 0 to a 1. The world’s most robust computer systems, at nasa, financial firms, and the like, used special hardware that could tolerate single bit-flips. But Google, which was still operating like a startup, bought cheaper computers that lacked that feature. The company had reached an inflection point. Its computing cluster had grown so big that even unlikely hardware failures were inevitable.

[+] 4gotunameagain|3 years ago|reply

This is why aerospace is fun..

You get to design systems that are supposed to work under a shower of particles. A solar storm during a solar maximum is no joke and requires some serious fault tolerance

[+] Razengan|3 years ago|reply

Wow I was JUST thinking about this today:

Can there be a hardware random number generator that solely relies on such cosmic particles etc. hitting it?

Bonus: Use it as signs from God (reminiscing of Terry Davis of TempleOS)

[+] shagie|3 years ago|reply

Hotbits uses radioactive decay as its source https://www.fourmilab.ch/hotbits/

It measures the time between two events (call this T1) and then the time between the next two events (call this T2).

If T1 > T2, you've got a 1. If T1 < T2, you've got a 0. If T1 = T2 throw it out.

As this is measuring the time between events rather than the rate of events, a decaying source just gives less bits over time rather than less randomness.

> The trick I use was dreamed up in a conversation in 1985 with John Nagle, who is doing some fascinating things these days with artificial animals. Since the time of any given decay is random, then the interval between two consecutive decays is also random. What we do, then, is measure a pair of these intervals, and emit a zero or one bit based on the relative length of the two intervals. If we measure the same interval for the two decays, we discard the measurement and try again, to avoid the risk of inducing bias due to the resolution of our clock.

[+] V__|3 years ago|reply

This is what random.org does:

> A binary digit (bit) can be either 0 or 1. There are several Random.org radios located in Copenhagen, Dublin, and Ballsbridge, each generating 12,000 bits per second[8] from the atmospheric noise picked up.[9] The generators produce a continuous string of random bits which are converted into the form requested (integer, Gaussian distribution, etc.)

[1] https://en.wikipedia.org/wiki/Random.org

[+] idiocrat|3 years ago|reply

For increased entropy, you can use a thermal noise from a cheap camera or a static noise from a microphone.

[+] Bayart|3 years ago|reply

As far as I know, the counts per minute for background radiation you get with cheap compact captors (say the stuff you find on Geiger counters) is pretty low, a dozen to a hundred counts per minute. That's not a lot of entropy. You could much more precise captors, but then you couldn't just plug it into a PCIe slot. At this scale, you probably get more entropy from just jitter on the electrical circuit. And it's pointless to have one huge centralized RNG, at least from a security standpoint.

[+] defrost|3 years ago|reply

Sure .. although you'll find it will have a (perhaps surprisingly) consistant distribution of energies and timings and here on earth will fluctuate by density of atmosphere above (height, humidity, tempreture) and relationship to earths magnetic flux lines with a partial coupling to solar output.

Airborne radiometric ground surveys run calibration flights at varying altitude to build an estimate of cosmic activity in order to subtract that from ground events originating from Uranium, Potassium, Thorium, Radon, etc.

[+] bgirard|3 years ago|reply

Wouldn't there be a more likely explanation for an unintended bit flip than a cosmic ray? Perhaps some random hardware effect like an unintentional 'Row hammer' bit flip in other parts of the system, a very rare hardware race condition, a very unlikely quantum tunnel for the node size, etc...?

[+] GrabbinD33ze69|3 years ago|reply

Wow, a pacemaker malfunctioning due to a stray cosmic ray is incredible/alarming; I've only heard of these single upset events in the context of aerospace applications, not a medical device.

[+] gl-prod|3 years ago|reply

I guess give it enough time and it will happen

[+] airbreather|3 years ago|reply

Digital cameras are shipped by sea to avoid/lessen sensor damage they would be subjected to 7 miles high from greater cosmic radiation.

[+] 7373737373|3 years ago|reply

Is critical infrastructure usually protected against this?

38 comments