One thing to consider when looking at such things is that commercial avionics software systems are full of known limitations. I do not know if this particular 51-day limitation was intentional or not, but in general:
Avionics software starts with writing comprehensive requirements. When the software itself is developed based on those requirements, it is then tested against the requirements, always in a real functioning airplane, but also often in smaller airplane-cockpit-like rigs and in purely simulated environments.
Nobody is going to write a requirement that says "this avionics subsystem will function without error forever". Even if you thought you could make it happen, you can't test it. So there are going to be boundaries. You might say that the subsystem will function for X days. What happens after that? It may well run just fine for X+1 days, or 2X days, or 100X days. But it's only required to run for X days, and it's only tested and certified for running for X days.
I could easily imagine that this particular subsystem was required and certified for some value of X <=51 days, and it just so happened that if the subsystem ran for over 51 days then it started to fail. Or, it could have been a genuine mistake.
But even if the intended X wasn't 51 days, there almost certainly was some intended, finite value for X. We might say, "well, my laptop has run for three years without needing a reboot". Great! Is that a guaranteed, repeatable state of operation that the FAA would certify? Probably not. And besides that, do we really want to have to endure a three-year verification test?
In most software, we are happy to say, "it should run indefinitely". For avionics software, that's insufficient. We instead say "it will run at least for some specific predetermined finite amount of time" and then back up that statement with certifiable evidence.
I work in a field that operates under similar development constraints. (Namely it's a mature product in a mature field with well defined requirements) Because if this I regularly get calls from my customers wondering why their system can't do X or Y in the B way instead of the A way, and I have a similar conversation. Wherein I have to explain "no, that wasn't part of your requirements 5 years ago, if you want to change it, you'll need to pay us for more development", that normally eliminates the requirement for whatever it was they wanted pretty quickly.
Also, uptime is a factor, I've seen what windows looks like when it runs out of GDI objects, it's strange. But once you see it, you can explain to the customer the importance of regular reboot/restarts.
If they knew about it there'd be no need for an AD. Boeing tried to become the aviation equivalent of a fabless chip designer with the 787 and it didn't go well at all. Turns out they had little-to-no experience managing external development and manufacturing teams. I don't know anything about the 51-day bug, but the 248-day bug caused critical failures that you really wouldn't want happening in flight.
> Nobody is going to write a requirement that says "this avionics subsystem will function without error forever".
These time limits could at least be pegged to real-life intervals to when the system is going to be shut down anyway. If the system continues to be operated past that point, skipped maintenance intervals could be underlined as the cause.
For example, let’s imagine that the timestamp set by the transmitting ES is close to its wrap-around value. After performing the required calculation, the receiving ES obtains a timestamp that has already wrapped-around, so it would look like the message had been received before it was actually sent.
Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today? What's more disappointing is seeing all the other incredible systemic complexity they've added, and yet the plane appears to have no mechanical backup instruments?
> and yet the plane appears to have no mechanical backup instruments[?]
This is unlikely in a modern aircraft because mechanical instruments to back up e.g., the artificial horizon / attitude indicator or directional gyro (DG) / heading indicator are:
1) Mechanically complex - the attitude indicator and DG make use of gyroscopes which rotate at up to 24,000 RPM along with other mechanisms. They are typically powered by vacuum or electric motors which consume relatively more power (or require vacuum lines and a vacuum pump)
2) Expensive to maintain - see (1) - they need to be serviced somewhat regularly
(3) Heavier than their solid-state counterparts
(4) Have [dramatically] different failure modes - instead of a display going dark, a DG will slowly drift as the gyroscope precesses, giving erroneous values. Same with the artificial horizon. This can lead to catastrophic results under instrument meteorologicalconditions (IMC) where the pilots rely solely on instruments to maintain essential things such as heading and level flight.
(5) Because of (4) they require additional redundancy to ensure instruments can be cross-checked with one another. This compounds (2) and (3)
> Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today
Even in TCP sequence numbers, it can be implemented incorrectly.
51 days seems to be approximately how often my mac dies in kernel panic or starting to be bugged by persistent software problems that go away with a restart.
I remember articles of the Airbus A350 requiring reboots every N days (150ish or so?). I remember the Patriot missile system required a reboot every 24 hours or so until they fixed the software defect which caused the time counting to drift. And I'm pretty sure there are many more such cases where devices fail if kept on for too long, even in spaces where you are supposed to fill out a lot of "paper"work + jump through a lot of defined processes like in avionics, medical, or automotive field, among a good few others (safety and all that).
We had a bug years ago that after 50 days of uptime all network sessions dropped on our devices. Apparently it was a session timer overflow in a variable. I think it was unsigned int and time was in milliseconds.
Or was there an expectation that a regular maintenance check would occur within this time frame that involved a reboot as part of the maintenance check for diagnostics?
tjr|3 years ago
Avionics software starts with writing comprehensive requirements. When the software itself is developed based on those requirements, it is then tested against the requirements, always in a real functioning airplane, but also often in smaller airplane-cockpit-like rigs and in purely simulated environments.
Nobody is going to write a requirement that says "this avionics subsystem will function without error forever". Even if you thought you could make it happen, you can't test it. So there are going to be boundaries. You might say that the subsystem will function for X days. What happens after that? It may well run just fine for X+1 days, or 2X days, or 100X days. But it's only required to run for X days, and it's only tested and certified for running for X days.
I could easily imagine that this particular subsystem was required and certified for some value of X <=51 days, and it just so happened that if the subsystem ran for over 51 days then it started to fail. Or, it could have been a genuine mistake.
But even if the intended X wasn't 51 days, there almost certainly was some intended, finite value for X. We might say, "well, my laptop has run for three years without needing a reboot". Great! Is that a guaranteed, repeatable state of operation that the FAA would certify? Probably not. And besides that, do we really want to have to endure a three-year verification test?
In most software, we are happy to say, "it should run indefinitely". For avionics software, that's insufficient. We instead say "it will run at least for some specific predetermined finite amount of time" and then back up that statement with certifiable evidence.
Aloha|3 years ago
Also, uptime is a factor, I've seen what windows looks like when it runs out of GDI objects, it's strange. But once you see it, you can explain to the customer the importance of regular reboot/restarts.
FPGAhacker|3 years ago
inferiorhuman|3 years ago
https://www.engadget.com/2015-05-01-boeing-787-dreamliner-so...
If they knew about it there'd be no need for an AD. Boeing tried to become the aviation equivalent of a fabless chip designer with the 787 and it didn't go well at all. Turns out they had little-to-no experience managing external development and manufacturing teams. I don't know anything about the 51-day bug, but the 248-day bug caused critical failures that you really wouldn't want happening in flight.
xattt|3 years ago
These time limits could at least be pegged to real-life intervals to when the system is going to be shut down anyway. If the system continues to be operated past that point, skipped maintenance intervals could be underlined as the cause.
trenchgun|3 years ago
Not by testing, but by using formal methods.
userbinator|3 years ago
Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today? What's more disappointing is seeing all the other incredible systemic complexity they've added, and yet the plane appears to have no mechanical backup instruments?
steffan|3 years ago
> and yet the plane appears to have no mechanical backup instruments[?]
This is unlikely in a modern aircraft because mechanical instruments to back up e.g., the artificial horizon / attitude indicator or directional gyro (DG) / heading indicator are:
1) Mechanically complex - the attitude indicator and DG make use of gyroscopes which rotate at up to 24,000 RPM along with other mechanisms. They are typically powered by vacuum or electric motors which consume relatively more power (or require vacuum lines and a vacuum pump)
2) Expensive to maintain - see (1) - they need to be serviced somewhat regularly
(3) Heavier than their solid-state counterparts
(4) Have [dramatically] different failure modes - instead of a display going dark, a DG will slowly drift as the gyroscope precesses, giving erroneous values. Same with the artificial horizon. This can lead to catastrophic results under instrument meteorologicalconditions (IMC) where the pilots rely solely on instruments to maintain essential things such as heading and level flight.
(5) Because of (4) they require additional redundancy to ensure instruments can be cross-checked with one another. This compounds (2) and (3)
Teongot|3 years ago
Even in TCP sequence numbers, it can be implemented incorrectly.
https://engineering.skroutz.gr/blog/uncovering-a-24-year-old...
rootusrootus|3 years ago
Aperocky|3 years ago
pixelfarmer|3 years ago
Yizahi|3 years ago
mormegil|3 years ago
rcyeh|3 years ago
saratogacx|3 years ago
https://www.cnet.com/culture/windows-may-crash-after-49-7-da...
drewrv|3 years ago
taneq|3 years ago
Gibbon1|3 years ago
junar|3 years ago
unknown|3 years ago
[deleted]
kreelman|3 years ago
acdanger|3 years ago
newsclues|3 years ago
Was it a cost issue?
Or was there an expectation that a regular maintenance check would occur within this time frame that involved a reboot as part of the maintenance check for diagnostics?
Taniwha|3 years ago
kelnos|3 years ago