Faults, errors, and failures

[+] lucas_membrane|2 years ago|reply

>> Errors are almost always the result of faults. Barring cosmic rays, hardware issues or really unusual race conditions between the application and the operating system, if an error occurs it is because the programmer screwed up and introduced a bug. <<

The 'cosmic ray' category has taken in an awful lot of additional territory since computers started running multiple programs at once and sharing resources between programs and with the world in general. About 60 years ago, perhaps lightheartedly, it was suggested that computer instruction sets should contain a branch-on-chipbox-full instruction. The chipbox, a shared resource, was where the cardpunch disposed of the confetti produced by punching cards. As we were learning that anything that could go wrong would, it was logically inferred that if a full chipbox was ignored long enough while punching cards, the computer would find a way to stop, fail, or burn down the building, and no existing software could prevent that possibility on its own. A comparable situation in this century is that typical computers today allow the operator to 'adjust' the system clock, but a very small fraction of software is written to accommodate all the possible consequences of non-monotonic time, time being a shared resource. And if you do write software to handle said consequences, how do you handle it if the program has no way of telling if it is running on a system on which non-monotonic time is allowed or not-allowed, catastrophic or all in a days work?

[+] sirwhinesalot|2 years ago|reply

I always love seeing these historical anecdotes on hackernews and I couldn't be happier to see one on my own post. Thank you!

[+] catears|2 years ago|reply

I liked the article so I wanted to give you some feedback. Hope it is useful to you!

- I don't think the definitions of error and failure are 100% correct as stated. Looking at the IEEE definition that you reference, I interpret error meaning the difference between the value that is stored in the program, and the correct/intended value. For example if we expect to have a value of 100, but in fact have 110, the error is 10. I don't think that whether the value is observed or not is what categorizes it as either an error or a failure. If I run my program in the debugger and find that a value is off from what it is supposed to be, does that shift it from an error to a failure?

- One point I think you should have leaned more into is how language constructs and tools can help prevent failures, or cause more of them if they are bad. You bring up the point with Haskell and Rust, and how they systematically reduce the number of faults a programmer can make. You also bring up the point of Exceptions introducing a lot of complexity. I think these two examples are great individually. I think putting them together and comparing them would have been powerful. Maybe a section that argues why Rust omitting exceptions makes it a better language. - A side note since I also hate exceptions: did you know that the most common (and accepted?) way to communicate exceptions in C# is via doc comments written manually by humans. Good luck statically analyzing that!

- A lot of the text revolves around the terms error, failure, and fault and how people use these in communication. Often with different ideas of what the words mean. Even the titles (jokingly? "correctingly"?) reference this. Even with the definition at the start, the ambiguity of these terms was not dispelled. I think a major part of that was the text using the terms like you defined them, and also the common "misunderstood" versions of the terms. I think a strategy you could have deployed here is to use less overloaded words throughout the article and sticking to those throughout the article. For example (without saying these are the best terms for the job), instead of fault, error, and failure, using defect, deviation, and detected problem.

- A note on the writing style. Many words are quoted, and many sentences use parenthesis to further explain something. At least to me, these things make the text a bit jumpy when overused. I would try to rewrite sentences that end with a parenthesis by asking myself "what is missing in the sentence so I don't need to resort to parenthesis?". Don't be afraid to break a long sentence into many!

Hope my comments come of as sincere, if not then that's on me! Good luck with your continued writing.

[+] layer8|2 years ago|reply

> A side note since I also hate exceptions: did you know that the most common (and accepted?) way to communicate exceptions in C# is via doc comments written manually by humans. Good luck statically analyzing that!

Java having checked exceptions is the primary reason I’m sticking with that language. Many libraries don’t use them, unfortunately, but an application that embraces them systematically is bliss in terms of error handling, because at any place in the code you always know exactly what can fail for what non-bug reasons.

[+] sirwhinesalot|2 years ago|reply

Constructive feedback is always appreciated!

The only thing I'll comment on is the IEEE stuff. I was taught these terms in a university course on fault tolerance. You'll find slides from various courses using them like this or similar if you search on Google, and that particular IEEE standard was mentioned as the source (I never personally read it). I have read a later standard that rather than defining error specifically, mentions all the various ways in which the term is used.

The thing is, the actual standard is irrelevant, it wasn't meant as an appeal to authority. Rather, it's a source of 3 related terms (fault/error/failure) that can be used to refer to the 3 distinct ideas discussed throughout the post.

Your suggestions for alternative names are just as valuable and just as useless, neither the ones in the standard nor your own are generally agreed upon. My hope was that by using a somewhat common triple I would have avoided pointless discussion on the terms themselves, rather than the ideas discussed in the post.

As this hackernews comment section demonstrates, I was all for naught ;)

[+] marcosdumay|2 years ago|reply

> did you know that the most common (and accepted?) way to communicate exceptions in C# is via doc comments written manually by humans.

Well, the accepted way to communicate them in Python is "we don't". I think C++ follows that same principle, but the ecosystem is extremely disconnected, so YMMV.

Java tried to do a new and very good thing by forcing the documentation of the exceptions. But since generics sucked for the first ~20 years of the language, and nobody still decided to apply them to the exceptions, it got bad results that discouraged anybody else from trying.

[+] d0mine|2 years ago|reply

There is a cost in trying to force the language to find bugs for you. More is not always better. Unlike a linter, ignoring false positives from a compiler requires more work to work around them.

Not having exceptions in the language creates a tradeoff as well. This may lead to either ignoring errors or adding non-linear boilerplate between where the issue is detected and where the code can handle it, negatively impacting readability and refactoring.

[+] unknown|2 years ago|reply

[deleted]

[+] wredue|2 years ago|reply

FWIW, the creators of rust themselves distance themselves from “if it compiles, it works”, because this is obviously not true.

If your definition of “works” ignores behavioural requirements, then I suppose.

[+] unknown|2 years ago|reply

[deleted]

[+] k3vinw|2 years ago|reply

This is why I like the Either monad found in functional programming. You either have your return value or the error. No exception handling nonsense.

[+] rowls66|2 years ago|reply

Is it really that much different? You still need to handle a Left value, and a lot like handling an exception.

[+] bubuche87|2 years ago|reply

Some thoughts. 1/ I think that it's not always possible to modify the domain. For example, I could have a function that takes a name of a file as parameter and returns a CanBeWritten object. Now, I could have a function that open a file in write mode and take an object of this type as parameter.

The issue is that between the moment I acquire this object and the moment I use it, the file could, you in fact, become non-writeable. (There was a post on hn about this idea of using the type system like this https://news.ycombinator.com/item?id=35053118 ).

I think you focus a lot on software issues and neglect the hardware ones. But it's a choice.

Still my thoughts (but at this point you already understood that it was going to be like that the entire post): I think that when a fault is detected (when it becomes a failure if I follow your definitions), an attempt to fix the problem and return to a normal state can actually fail - by incorrectly fixing the issue. Like: you have three times the same integer (redundancy) and one of them have a bit flipped. You decide that the one different from the two other is the incorrect one. You detected a problem, you tried to fix it. But it could be the case that two bitflips occured at the same position.

There is no definitive solution to that, but documenting all the detected problems AND the fixes applied to them would help.

And for the error messages ... Well, my position is that most of the time they are useless for the end user. They can be useful for the developer. For the end user, the best error message (if such a message is required) is something unique enough to be copy-pasteable on Google to find a solution that the user will not understand but will be able to apply.

I used to consider (when I started computer science) that an algorithm is like going from point A to point B on a city map. There is essentially one "good" path and a huge quantity of "wrong" paths were you can get lost. And by trying to find your way, you can make the situation even worse.

[+] sirwhinesalot|2 years ago|reply

Thanks for the input!

1 - Yes, when it comes to things that touch the hardware or the OS it's hard to encode these things at the type system level since they can change from under you. This is a great example where it is useful to handle some faults at the type level (i.e., file might be missing, remember to check) while handling others as failures (file got read-only out of nowhere... better abort what I was doing).

2 - Yup, trying to fix errors often makes it worse, which is why simply restarting is often the best way to go :)

[+] worksonmine|2 years ago|reply

A graceful message instead of letting the entire process crash is a way of handling even unexpected errors, i.e. 5XX on the web. Without anticipating them some backends will completely crash.

[+] wslh|2 years ago|reply

I think the article is wrong. Radiation/Cosmic Rays crashes require BFT (Byzantine Fault Tolerance) because it can send a 1 to a node and a 0 to another node, even if it is not malicious. For example, see [1]. Formally, CFT (Crash Fault Tolerance) does not handle this case.

[1] https://www.usenix.org/system/files/conference/atc12/atc12-f...

[+] sirwhinesalot|2 years ago|reply

Awesome paper you shared but I don't really see how my article is wrong from reading it? The focus of my article is on programmer-introduced faults not hardware failures.

[+] Const-me|2 years ago|reply

> error, which is an unobserved, incorrect internal state

For example, amount of available space on the system drive is not an internal state. However, once the number reaches zero, failures of all software are very likely to happen. The software will fail regardless of static type systems or unit tests coverage.

In my experience, external things like that (not enough disk space, not enough memory, unsupported instructions, broken TCP connections) cause large percentage of failures in the software I’m developing.

[+] svilen_dobrev|2 years ago|reply

i have used the MissingThing approach a few times, incl. as specific pre-existing NullRecord in SQL db to avoid having null-FKs. But also as NotPassed singleton for default func args etc.

In Some cases it worked perfect (like in math: 0 is just another number - as long as you don't divide by it), sometimes it needed extra "arithmetic", but still worked... and Sometimes it did not work.. and was abandoned, as handling nulls was much easier than handling ThisSpecial stuff.

Sometimes it is possible to avoid the zero thing all-together but needs a magnitude more thinking on higher levels of abstraction.

btw: has anyone heard of array-of-negative-size? Like a eating hole (if positives are a peg) .. 5 + -3 == 2, so app-ending such 3-hole to a 5-peg array will shorten it to first 2 only..

[+] nwhnwh|2 years ago|reply

Do you write for yourself or for other people?

[+] sirwhinesalot|2 years ago|reply

They're pure brain dumps on whatever is on my mind at the time. But if other people find my incoherent rambling useful, then it's worth sharing on the interwebs. Hope you got something out of it!

[+] ktpsns|2 years ago|reply

What a great text with useful references and links. Cudos to the author, who is also the OP.

[+] sirwhinesalot|2 years ago|reply

Do let me know if I got anything wrong or missed something.

[+] Dwedit|2 years ago|reply

I hate sites that make things pop up when you select text.

[+] sirwhinesalot|2 years ago|reply

Apologies for that, it's a substack thing... I used to have a self hosted website but wasted more time tweaking the theme than writing :(

[+] ATMLOTTOBEER|2 years ago|reply

I agree with everything in this article. Can we work together lol

[+] sirwhinesalot|2 years ago|reply

If you're asking I'm guessing your current work colleagues don't keep these distinctions in mind. Neither do mine x). They're great folk though!

[+] loeg|2 years ago|reply

(Because the author redefines error to tautologically mean an unhandled condition.)

[+] dang|2 years ago|reply

The article deserves a bit better so I replaced the baity title with a more representative subhead.

[+] sirwhinesalot|2 years ago|reply

Not my definition, the IEEE Standard Glossary of Software Engineering Terminology definition. Yes the title is clickbait (I mention that's intentional in the post) because people keep mixing up faults and failures, be it i.e. Elm's approach (all in on faults) vs Erlang's approach (all in on failures).

Without properly defined terminology understanding the difference in focus leads to unproductive discussion.

EDIT: In your defense the later standards appear to have made "error" a uselessly wishy-washy term again, so eh. The terms fault/error/failure as defined in the post are still used in the study of fault tolerance.

61 comments