Rather than the Therac-25 debacle, I would more recommend looking into the Toyota 'unintended acceleration' case. And the legal fallout from that. Because it is terrifying. Toyota was essentially as grossly negligent as it is possible to be. And the result? The court said there existed no standards they could be held legally liable for violating. So your self driving car? It will be developed by junior developers hired as cheaply as possible, driven like slaves by business-oriented managers who only care about meeting schedules, not given the tools or information needed to do an effective job, with testing time cut short and any claimed 'industry standards' for safe coding ignored. The automotive industry had 90+ coding practices they list as either 'required' or 'recommended'. Toyota followed 4 of those in their code. And the court said this was OK. Do you think Toyota spent tens or hundreds of millions of dollars rebuilding their entire development infrastructure, hiring more competent software engineers, firing the business managers who got people killed by rushing an unsafe product to market, and putting the engineers in charge of all future decisions regarding scheduling and release? No, of course not. If anything, they probably saw it as carte blanche to make things worse.
Wasn’t the cause of the “unintended acceleration issue” hardware and driver error related rather than software related? That’s what I’ve read in one book about Toyota management (the Toyota Way) and also a Wikipedia entry on the topic says:
> Toyota also claimed that no defects existed and that the electronic control systems within the vehicles were unable to fail in a way that would result in an acceleration surge. More investigations were made but were unsuccessful in finding any defect until April 2008, when it was discovered that the driver side trim on a 2004 Toyota Sienna could come loose and prevent the accelerator pedal from returning to its fully closed position.[4]
Based on those two sources it seems the issue was hardware related, and Toyota may have tried papering over the matt issue. The faulty matt design issue doesn’t support your claim of shoddy software practices and hiring underpaid junior developers. That may still be the case but it appears not to have caused the SA issue.
For modern industrial applications, the safety circuit is often (edit- see child comment's note on safety PLCs) managed by discrete safety relay hardware such as the AB GuardMaster or Pilz PNOZ. There's a good chance these weren't even available at the time of OP's application!
A common configuration involves emergency stops, guard doors, light curtains, etc. being wired in a pair of loops with the relay. The relay continuously monitors both loops (usually with a phased pulse train), and any interruption or crossover will trip the unit. Only when the loop states return to nominal will the relay permit a reset to re-enable the outputs.
The safety relay's outputs are generally connected to dumb hardware interlocks on the various dangerous bits of the machine.
No, most even small machines arrive with a kind of "Safety Integeated" right in the PLC. Even the smallest PLCs like a Siemens S7 1200 are now aviable with Safety Integrated. So Profi-Safe and ASi-Safe are very common. It does reduce A LOT of wiring. But of course brings new problems, for example RJ45 jack and plugs sometimes break the connection for a very short time if you touch it and you lose a safety packet over Profi-Net and .. boom .. emergency stop.
Special if you use a lot of drives in a machine, any kind of Safety Integrated reduce the wiring a lot and makes cabinets much much smaller.
But on the other side, for just once, yes I like Pilz PNOZ. Easy to use .. and I'm pretty sure you can buy a PNOZ even in 100 years.
As a programmer, I like this approach to human safety for robots. By putting electrical interlocks on the doors that expose humans to the robot you can make it impossible for a software error to hurt a human.
For some applications where you need to have humans working in the same area with the robot things get a lot hard. You probably need some software involved in enforcing speed limits for robots. The compliance engineers I've talked to say getting safety certification for software is quite arduous. In this case the off-the-shelf solutions the parent child comments mention become valuable.
When I was the CTO & VP of Engineering for Wayport (public, mostly hotel Internet) we designed an Ethernet switch that could use Home PNA or Ethernet PHYs. (Later adapted to also offer VDSL to an in-room modem.)
We also designed our own 802.11 access points.
All of our competitors had at least one fire. In a hotel with hundreds of people asleep. It didn’t matter if they used commercial gear or not. Every one of them had a fire.
We never had one, but I was obsessed with not hurting anyone because we had missed something.
They might not have paid for a source code licence. or they did, but they never made sure they had a copy, just left it with the developer. Surprsingly common for companies to get a big binder of paperwork, an installer disk, and consider it done.
Ha! I once wrote a program to calculate commission for the sales people at our company. I remember the director telling me he would love my numbers to be true, but thought it best I had another look. Fat fingered decimals could have resulted in some expensive commissions!
This was an enjoyable read. While I don't often lose sleep over my code, (it's my kids that cause that), I do often find that my mind is working on solving coding puzzles in my sleep as I will frequently wake up with a spark of insight.
My own personal "staying awake at night" case is an application that connected to an ancient version of Banner (A big kludge of an ERP system for universities) to handle new applicants' enrollment process and billing. I'm rather skittish when directly handling money, doubly so when the code was all written in Perl (Which I had to learn in order to re-implement in far more readable PHP, the lesser of two evils, natch) and extremely poorly documented. In retrospective, I should not have accepted a job like that.
You make a valid point, pity to see it downvoted. Please keep in mind that in many dialects of BASIC you didn't have more datatypes than 'string' and 'float', and that the original program used floats. Even so, it is definitely possible to make reliable software using floating point, you just have to know exactly what you are doing and you will spend a lot of time on tests to ensure that you do not end up with nonsense output in edge cases.
Floating point is used regularly in avionics and other fields involving critical computations, it is not the floating point data type itself that is problematic, what is problematic is poorly understanding the underlying implementation details and what kind of limitations that will cause.
A good example is trying to count integers with more bits than the underlying precision of the implementation will allow. But in this particular case floating point would have been my 'tool of choice' anyway, fixed point would have introduced a lot of complications for very little gain - if any - gain and would not be worth it.
So in many cases I would agree with you that rounding errors could cause huge problems but in this particular case the inputs were in ranges where this could not happen, the software was tested exhaustively across all input ranges to ensure well defined behavior.
Please. Don't kneejerk. Any floating point errors will be multiple orders of magnitude smaller than than the accuracy of either the fuel gauges or the pumps.
Well before tooling is considered, it has to involve people and process. At the highest level, you must have a culture of "blame the process, not the people" or people will do what is natural when things go wrong: try to cover it up and avoid being blamed.
There are procedures in various safety-conscious industries for handling this kind of development. I like that you used the word "systemic" because it is literally a systems issue, not a software, or electronics, or mechanical issue. The entire system has to be considered and analyzed for potential faults.
I spent over a decade writing code for medical devices and while the software aspect of these systems was the most advanced in terms of development process (unlike what many on HN seem to think :-), everything we did had to be considered from a system perspective because even if the individual parts were designed properly, it was possible for the interactions between them to cause problems.
Procedures and documentation seem to work well for the aviation industry. Things will still go wrong, but only very rarely twice in the same way. It makes development a lot more expensive but it does work and probably is the only way that we are aware of right now that will get this done in a way that leads to acceptable outcomes.
This leads to glacial progress but I find that is preferable over the 'move fast and break shit' mentality that pervades the software industry.
[+] [-] otakucode|8 years ago|reply
[+] [-] elcritch|8 years ago|reply
> Toyota also claimed that no defects existed and that the electronic control systems within the vehicles were unable to fail in a way that would result in an acceleration surge. More investigations were made but were unsuccessful in finding any defect until April 2008, when it was discovered that the driver side trim on a 2004 Toyota Sienna could come loose and prevent the accelerator pedal from returning to its fully closed position.[4]
Based on those two sources it seems the issue was hardware related, and Toyota may have tried papering over the matt issue. The faulty matt design issue doesn’t support your claim of shoddy software practices and hiring underpaid junior developers. That may still be the case but it appears not to have caused the SA issue.
[+] [-] snerbles|8 years ago|reply
A common configuration involves emergency stops, guard doors, light curtains, etc. being wired in a pair of loops with the relay. The relay continuously monitors both loops (usually with a phased pulse train), and any interruption or crossover will trip the unit. Only when the loop states return to nominal will the relay permit a reset to re-enable the outputs.
The safety relay's outputs are generally connected to dumb hardware interlocks on the various dangerous bits of the machine.
[+] [-] _trampeltier|8 years ago|reply
Special if you use a lot of drives in a machine, any kind of Safety Integrated reduce the wiring a lot and makes cabinets much much smaller.
But on the other side, for just once, yes I like Pilz PNOZ. Easy to use .. and I'm pretty sure you can buy a PNOZ even in 100 years.
[+] [-] MarkSweep|8 years ago|reply
For some applications where you need to have humans working in the same area with the robot things get a lot hard. You probably need some software involved in enforcing speed limits for robots. The compliance engineers I've talked to say getting safety certification for software is quite arduous. In this case the off-the-shelf solutions the parent child comments mention become valuable.
[+] [-] gonzo|8 years ago|reply
We also designed our own 802.11 access points.
All of our competitors had at least one fire. In a hotel with hundreds of people asleep. It didn’t matter if they used commercial gear or not. Every one of them had a fire.
We never had one, but I was obsessed with not hurting anyone because we had missed something.
And yes, it kept me up at night.
[+] [-] sleepychu|8 years ago|reply
[+] [-] cube00|8 years ago|reply
Is that a diplomatic way of saying someone lost the source code?
[+] [-] asterius|8 years ago|reply
[+] [-] rodrigocoelho|8 years ago|reply
[+] [-] alex_hitchins|8 years ago|reply
[+] [-] hawktheslayer|8 years ago|reply
[+] [-] kemonocode|8 years ago|reply
[+] [-] fnord77|8 years ago|reply
Hello, rounding errors. Oh hell no.
[+] [-] jacquesm|8 years ago|reply
Floating point is used regularly in avionics and other fields involving critical computations, it is not the floating point data type itself that is problematic, what is problematic is poorly understanding the underlying implementation details and what kind of limitations that will cause.
A good example is trying to count integers with more bits than the underlying precision of the implementation will allow. But in this particular case floating point would have been my 'tool of choice' anyway, fixed point would have introduced a lot of complications for very little gain - if any - gain and would not be worth it.
So in many cases I would agree with you that rounding errors could cause huge problems but in this particular case the inputs were in ranges where this could not happen, the software was tested exhaustively across all input ranges to ensure well defined behavior.
[+] [-] TylerE|8 years ago|reply
[+] [-] joezydeco|8 years ago|reply
http://catless.ncl.ac.uk/Risks/
[+] [-] knodi123|8 years ago|reply
[+] [-] w_t_payne|8 years ago|reply
Clearly 'blame' isn't an appropriate response. It has to involve tooling.
[+] [-] HeyLaughingBoy|8 years ago|reply
There are procedures in various safety-conscious industries for handling this kind of development. I like that you used the word "systemic" because it is literally a systems issue, not a software, or electronics, or mechanical issue. The entire system has to be considered and analyzed for potential faults.
I spent over a decade writing code for medical devices and while the software aspect of these systems was the most advanced in terms of development process (unlike what many on HN seem to think :-), everything we did had to be considered from a system perspective because even if the individual parts were designed properly, it was possible for the interactions between them to cause problems.
[+] [-] jacquesm|8 years ago|reply
This leads to glacial progress but I find that is preferable over the 'move fast and break shit' mentality that pervades the software industry.
[+] [-] fireismyflag|8 years ago|reply
[+] [-] draw_down|8 years ago|reply
[deleted]