Reverse Engineer’s Perspective on the Boeing 787 ‘51 days’ Directive

tjr|3 years ago

One thing to consider when looking at such things is that commercial avionics software systems are full of known limitations. I do not know if this particular 51-day limitation was intentional or not, but in general:

Avionics software starts with writing comprehensive requirements. When the software itself is developed based on those requirements, it is then tested against the requirements, always in a real functioning airplane, but also often in smaller airplane-cockpit-like rigs and in purely simulated environments.

Nobody is going to write a requirement that says "this avionics subsystem will function without error forever". Even if you thought you could make it happen, you can't test it. So there are going to be boundaries. You might say that the subsystem will function for X days. What happens after that? It may well run just fine for X+1 days, or 2X days, or 100X days. But it's only required to run for X days, and it's only tested and certified for running for X days.

I could easily imagine that this particular subsystem was required and certified for some value of X <=51 days, and it just so happened that if the subsystem ran for over 51 days then it started to fail. Or, it could have been a genuine mistake.

But even if the intended X wasn't 51 days, there almost certainly was some intended, finite value for X. We might say, "well, my laptop has run for three years without needing a reboot". Great! Is that a guaranteed, repeatable state of operation that the FAA would certify? Probably not. And besides that, do we really want to have to endure a three-year verification test?

In most software, we are happy to say, "it should run indefinitely". For avionics software, that's insufficient. We instead say "it will run at least for some specific predetermined finite amount of time" and then back up that statement with certifiable evidence.

Aloha|3 years ago

I work in a field that operates under similar development constraints. (Namely it's a mature product in a mature field with well defined requirements) Because if this I regularly get calls from my customers wondering why their system can't do X or Y in the B way instead of the A way, and I have a similar conversation. Wherein I have to explain "no, that wasn't part of your requirements 5 years ago, if you want to change it, you'll need to pay us for more development", that normally eliminates the requirement for whatever it was they wanted pretty quickly.

Also, uptime is a factor, I've seen what windows looks like when it runs out of GDI objects, it's strange. But once you see it, you can explain to the customer the importance of regular reboot/restarts.

FPGAhacker|3 years ago

Sounds about right. But it’s still a critical failure for a fault of any kind to ever display incorrect information to the pilot.

inferiorhuman|3 years ago

  I do not know if this particular 51-day limitation was intentional or no

I highly doubt it was intentional. Boeing's already had to issue an AD for similar behavior on the 787:

https://www.engadget.com/2015-05-01-boeing-787-dreamliner-so...

If they knew about it there'd be no need for an AD. Boeing tried to become the aviation equivalent of a fabless chip designer with the 787 and it didn't go well at all. Turns out they had little-to-no experience managing external development and manufacturing teams. I don't know anything about the 51-day bug, but the 248-day bug caused critical failures that you really wouldn't want happening in flight.

xattt|3 years ago

> Nobody is going to write a requirement that says "this avionics subsystem will function without error forever".

These time limits could at least be pegged to real-life intervals to when the system is going to be shut down anyway. If the system continues to be operated past that point, skipped maintenance intervals could be underlined as the cause.

trenchgun|3 years ago

It is on fact possible to write provably correct software for safety critical applications.

Not by testing, but by using formal methods.

userbinator|3 years ago

For example, let’s imagine that the timestamp set by the transmitting ES is close to its wrap-around value. After performing the required calculation, the receiving ES obtains a timestamp that has already wrapped-around, so it would look like the message had been received before it was actually sent.

Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today? What's more disappointing is seeing all the other incredible systemic complexity they've added, and yet the plane appears to have no mechanical backup instruments?

steffan|3 years ago

To address the second part:

> and yet the plane appears to have no mechanical backup instruments[?]

This is unlikely in a modern aircraft because mechanical instruments to back up e.g., the artificial horizon / attitude indicator or directional gyro (DG) / heading indicator are:

1) Mechanically complex - the attitude indicator and DG make use of gyroscopes which rotate at up to 24,000 RPM along with other mechanisms. They are typically powered by vacuum or electric motors which consume relatively more power (or require vacuum lines and a vacuum pump)

2) Expensive to maintain - see (1) - they need to be serviced somewhat regularly

(3) Heavier than their solid-state counterparts

(4) Have [dramatically] different failure modes - instead of a display going dark, a DG will slowly drift as the gyroscope precesses, giving erroneous values. Same with the artificial horizon. This can lead to catastrophic results under instrument meteorologicalconditions (IMC) where the pilots rely solely on instruments to maintain essential things such as heading and level flight.

(5) Because of (4) they require additional redundancy to ensure instruments can be cross-checked with one another. This compounds (2) and (3)

Teongot|3 years ago

> Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today

Even in TCP sequence numbers, it can be implemented incorrectly.

https://engineering.skroutz.gr/blog/uncovering-a-24-year-old...

rootusrootus|3 years ago

Fascinating analysis. I know planes get used a lot, but I'm surprised that they go for such a long time without ever being powered down.

Aperocky|3 years ago

51 days seems to be approximately how often my mac dies in kernel panic or starting to be bugged by persistent software problems that go away with a restart.

pixelfarmer|3 years ago

I remember articles of the Airbus A350 requiring reboots every N days (150ish or so?). I remember the Patriot missile system required a reboot every 24 hours or so until they fixed the software defect which caused the time counting to drift. And I'm pretty sure there are many more such cases where devices fail if kept on for too long, even in spaces where you are supposed to fill out a lot of "paper"work + jump through a lot of defined processes like in avionics, medical, or automotive field, among a good few others (safety and all that).

Yizahi|3 years ago

We had a bug years ago that after 50 days of uptime all network sessions dropped on our devices. Apparently it was a session timer overflow in a variable. I think it was unsigned int and time was in milliseconds.

mormegil|3 years ago

149 hours, see https://en.wikipedia.org/wiki/List_of_software_bugs#Transpor...

rcyeh|3 years ago

tl;dr: 51 days is the wraparound point of a signed 6-byte counter running at 33 MHz, used to invalidate stale data from instruments.

saratogacx|3 years ago

When I saw 51 days my first thought is it had to be a time rollover. Mainly because of this bug from long ago and how close the time spans are.

https://www.cnet.com/culture/windows-may-crash-after-49-7-da...

drewrv|3 years ago

This assumes there is no margin of error baked into the 51 day rule, which surprises me.

taneq|3 years ago

I feel like even 51 minutes might be too long to wait before invalidating stale instrument data on an aeroplane...

Gibbon1|3 years ago

All I have to say is if my firmware barf's after being up for 8.919 million years I won't care.

junar|3 years ago

Please add (2020) to the title.

unknown|3 years ago

[deleted]

kreelman|3 years ago

This is a really good analysis of the issue from the just the verbage from FAA. Well done.

acdanger|3 years ago

Reminds me of the LAX Air Traffic Control Shutdown of 2004: https://m.slashdot.org/story/49885

newsclues|3 years ago

Why?

Was it a cost issue?

Or was there an expectation that a regular maintenance check would occur within this time frame that involved a reboot as part of the maintenance check for diagnostics?

Taniwha|3 years ago

51 days is slightly more than 2^32 milliseconds?

kelnos|3 years ago

If that were the issue, then they'd have to reboot it every 49.7 days, no? Waiting 51 days would trigger the problem they're trying to avoid.

55 comments