How “let it fail” leads to simpler code

[+] mewse|3 years ago|reply

I was a believer in the "my code should never crash, no matter what" school of thought until I shipped a Dreamcast game with an out-of-date opening cutscene.

It was an in-engine opening cutscene which was very nearly final; the file we shipped was about two or three weeks out of date compared against the version that should have gone on the disc (It had one missing shape key on a character's face at the end of a shot, and a couple other missing elements). My code was wrangling the whole animation; doing all the stuff which our at-the-time-primitive animation system couldn't do itself (animating texture coordinates and etc). And my code was just silently handling all the errors it ran into so that we never even noticed that anything was wrong.

The difference was subtle enough that in the twelve years since the game was released, nobody but the original animator has ever noticed and mentioned it to me (and that, years after release). But that one experience and knowing how much worse it could have been was enough to convince me that "crashes early and crashes loudly with as much detail as possible" is by far the better strategy. At least for entertainment products. And doubly so for entertainment products which can't be patched after release.

(for clarity, this screw-up was 100% my fault. The animators had made the final changes to the cutscene data files in plenty of time for inclusion in the final build, I just somehow didn't import the changed data files into the game when I made the matching changes to the code side, and then my code didn't throw any errors to tell me or anyone else on the project that anything was wrong.)

[+] mike_hock|3 years ago|reply

Or even better, it should be the latter during development and the former in the released version.

You don't want your released game to crash in level 11 if the player happens to look behind the wrong lightpole because a texture is missing, but you do want to notice that in development.

[+] angarg12|3 years ago|reply

One of the pieces of software I'm most proud of is a service to manage the dynamic part of our infrastructure. It uses control theory and let it fail to great effect.

The service reads the state of the system, and applies change to converge to a configured policy. If it encounter an error, it doesn't try to handle or fix it, it just fails and logs a fine grained metric, plus a general error metric.

The system fails all the time at this scale, but heals itself pretty quickly. In over 1 year of operation it hasn't caused a single incident, and it has survived all outages.

[+] tsimionescu|3 years ago|reply

This is exactly why I think all the discussions about the importance of error handling paths (and the aversion drive have to exceptions) are usually overblown.

The most successful, and common, error handling strategy is to log and abandon the whole operation, cleaning up everything the operation left around. If you have one process per operation, this is often very well captured by doing exit() at the place of the error. If you don't, then exceptions are the best approximation of this pattern - much better than result types or error codes, which litter your code with irrelevant error handling details.

[+] arunaugustine|3 years ago|reply

Yes, please give us more info about using control theory and how one might think about building such a system please..

[+] fire|3 years ago|reply

this is the type of thing I'd love to see code / a post about implementing

[+] kortilla|3 years ago|reply

This is just kubernetes right? Declarative desired state model. Containers created and destroyed to get there. Crashes happen, metrics are incremented, load balancers route around the crashing pod until they recover (or are replaced), etc.

[+] secondcoming|3 years ago|reply

> fails and logs

What if your logging code was written with the same philosophy?

[+] ramraj07|3 years ago|reply

A corollary or generalized interpretation of this approach (and someone please specify if there’s a formal term for this) is: “fail locally, and immediately.”

What I mean is that once something unexpected happens your code should ideally fail in that step itself.

The simplest most common example I’ve seen with python programmers is when they pass around dicts as arguments in complex code bases. Methods expect various keys to be present, and often methods also have fail safe defaults if some keys are absent. The defaults are written for the specification, sure, but often they also tolerate unexpected exceptions that happened upstream.

Now when an unexpected exception happens, your program fails somewhere else and the stacktrace is useless. The only way to figure out what went wrong is to debug it line by line.

With python there’s still no elegant solution. I’m now trying to ensure all my methods are typed and use dataclasses and pydantic classes to type and group these parameters but there’s still opportunities for these “fail later” errors. Solutions and suggestions would be appreciated!

[+] MonkeyMalarky|3 years ago|reply

>Solutions and suggestions would be appreciated!

Ban the usage of default values or default parameters anywhere outside of top-level / public facing functions. Plus assert everything all the time.

I've gotten into arguments with other developers over it but I'll take the inconvenience in developing now over tearing hair out over bugs later, anytime.

[+] jeshin|3 years ago|reply

dicts are just a little too easy to use. You just smear it down, pass it around, and you're in business. If you really want to shoot yourself in the foot, also modify its structure here and there along the way, it's just so convenient. Who needs all that hassle of declaring a data class for each little thing?

It took me a little too long to realize that a data class represents a contract about the structure of your data, meaning that no matter how many calls deep you are passing it around, you will always know its structure without having to trace it back to the origin, and that's a powerful thing.

[+] strogonoff|3 years ago|reply

Coming back from TypeScript to Python, I found that most recent (3.10+) typing annotation shorthands are pretty succinct, and running mypy at all times really helps cut down on runtime errors.

My recipe:

— annotate variables, attributes, arguments and return values;

— run a good type linter (we use mypy) at all times;

— never pass around generic dictionaries: use dataclasses[0], TypedDicts, etc. instead.

That way you define a subclass inheriting from, say, TypedDict and declare that your function only takes that subclass. After that, you’ll get a loud error if you pass any dictionary that doesn’t match the spec (missing keys, wrong values, etc.)—ideally, right in your IDE.

(To reiterate, this would be a pointless exercise if you don’t lint all the time; most IDEs support this.)

[0] You can additionally use them with Pydantic, which can validate data at runtime at a cost of some performance overhead.

[+] bzxcvbn|3 years ago|reply

>Solutions and suggestions would be appreciated!

Use a language with strong typing?

[+] BurningFrog|3 years ago|reply

In general, I think using real classes with a single central definition instead of raw manually created dicts is the solution.

A thorough test suite is also needed, of course.

[+] packetlost|3 years ago|reply

As someone working with an extremely large Python codebase, early on we made the call to never allow dictionaries as arguments to functions (with exceptions for if the dictionary is truly arbitrary and only gets logged/persisted for human reading). We rely heavily on type annotations and dataclasses. Type system weaknesses aside, the system is rather maintainable despite its size, complexity, and domain.

[+] trav4225|3 years ago|reply

I've seen a lot of new developers shocked by this approach, which surprises me a little. They seem to think that it's up to the application to handle all errors, even those of the programmer(s). This, of course, is unreasonable since it would essentially require knowing all the bugs in advance. :-)

[+] allenu|3 years ago|reply

I'm a big fan of the "crash early" strategy. I write in Swift primarily, and if I suspect a state is impossible to reach, I'll add a fatalError() so that in development, if it turns out I'm wrong, I spot it right away. (Something I learned from another dev I worked with, who was very productive.)

Unfortunately, a lot of other devs hate to see that your code may actually crash and start asking questions about what scenario could cause it and asking if maybe there's a more gentle way to get out of the error. So, I'll often back down and start having softer error-handling, but on the whole it does complicate things further as the errors cascade and now you have to reason about handling combination of errors that have low likelihood of happening. So, to me, just having an early crash is way better.

[+] jillesvangurp|3 years ago|reply

It's a common mistake in code written by junior developers to only code the happy path. It leads to a very brittle system. A good example is a web application that needs a websocket open. What happens if you run such an application on a mobile phone and you temporarily lose connectivity and this happens multiple times as people walk around town because real world connectivity just isn't perfect? And also, they put their phone in their pocket and it goes to sleep. These are not user errors but expected, normal behavior.

Basically the happy path is that this simply never happens. You open a websocket and listen for incoming messages and process them. The actual situation is that you open a websocket and some time later it dies and then you simply attempt to reopen it until it succeeds and resume processing messages. The app has several states: connected, connecting, and not connected and should transition from one to the other depending on what happens.

Our frontend people struggled a lot with this exact issue. They only thought of the happy path and simply ignored any form of expected failure. So the first version of the app worked great for a while until it just stopped working. The fix: "just reload the app" was of course not really acceptable. All that was needed was a little defensive coding: assume this call will sometimes fail and simply try again when that happens. Then also handle the case where retrying will also fail because actually the request is wrong (input validation) and the error is the system telling you that it is wrong. If you don't have any code that handles that, you are going to have a very flaky UX.

[+] xdennis|3 years ago|reply

I was on a team for a short while (Java programmers) and their frontend code was really overly "careful". For example, they would always check if a method existed, before calling it.

    var o = new SomeObject();

    if (o.computeSomething != null && o.computeSomething != undefined) {
       o.computeSomething(...);
    }

Their reasoning was that in JavaScript (with the old syntax) you just add functions to the prototype, so you could forget to do it or mistype it.

    SomeObject.prototype.computeSomethinnn = function () ...

I was sort of tripping over myself in objections to what they were doing:

* you shouldn't check for null or undefined, but rather do `o.computeSomething instanceof Function`

* there's no need to do `!= null` and `!= undefined` because `!=` (as opposed to `!==`) actually checks for both

* you shouldn't do the check at all because if you actually mistype the function name all you're doing is hiding the error. Failing sooner is better.

* a missing method should be picked up in the unit tests (but they didn't have any tests at all because "our system is too complex to be tested automatically")

* probably some others...

That team really hated JavaScript and their code showed it.

BTW, the indentation above is not wrong... they did indent by 3 spaces. I read a story about 3 space indents on thedailywtf.com and thought that it was clearly made up... after this team I believe it.

[+] vardump|3 years ago|reply

Well, there's software that can cause some degree of harm. For example through servos controlling something physical. While you still probably can't catch all of the issues, you damn better try as hard as you can within reason.

I'd also wish for similar rigor from people developing whatever filesystens my data is on. :-)

Fail fast is generally a good idea, if you can do it safely.

[+] smackeyacky|3 years ago|reply

I don't agree with this approach. Say you have a network service that relies on other network services. It is not difficult to write those such that they know to back off / retry when something disappears.

It's extremely useful in a lot of situations: if you do work on a laptop that gets regularly unplugged, having running test services that know to reconnect makes your life easier. In production, having things automatically reconnect means a lot less restarting of services once whatever root cause problem is corrected. Just shrugging and giving up ends up being a lot more work in the end.

I like to tell junior developers to catch everything they can, and handle it or die as nicely as possible. Of course you can't plan for everything, but you can write around network and disk issues and issue warnings in a way that makes the root cause more obvious. That involves catching errors.

[+] TameAntelope|3 years ago|reply

What you're describing are "known" states; the idea behind "let it fail" is that you shouldn't write code that exhaustively handles every single potential outcome, just the ones that are part of your code's path in general use.

Definitely write code to handle network issues. Don't write code to handle random bitflips, ways to handle garbage coming back from the service you're connecting to, or try to handle OOM errors. Just let those fail.

Do not catch everything you can. That's the whole point of "let it fail". An app crashing is totally fine and expected behavior, in a lot of cases (of course if it's not fine, e.g. someone dies, don't do that but if you're working on that kind of software and taking advice from me, you're super duper screwed).

[+] verdagon|3 years ago|reply

I'd imagine retrying/reconnecting is compatible with the general "let it fail" approach. If you just sent a message/request to an actor/server and it still hasn't responded after 5 seconds, you can send another.

It wouldn't matter whether that actor/server died from a regular error or a "let it fail" error, the retrying would still work the same.

[+] akdor1154|3 years ago|reply

While the article focuses on the programming side, the other side is BEAM's Links and Supervisors which is what really allow this.

Letting BEAM handle that stuff like it is designed to could probably do a better job than your junior devs, and of course then free them up to be writing useful stuff instead.

[+] toast0|3 years ago|reply

Retrying is ideally handled at a single place though.

If the original client is going to retry for failure, including timeout, any intermediate retries are likely to result in signficant multiplication during outages, and that makes for a more difficult recovery.

It's also easy to miss reporting on intermediate retries and your system is running poorly and you didn't know.

Having things automatically reconnect is separate from automatic retries of individual requests.

[+] joshuamorton|3 years ago|reply

I've said this often specifically in the context of golang, but while you're right that retries and similar are a common case, they are fairly similar to the 'expected' error case in the article, and can almost always be handled at precisely the place where you raise the error.

In python this is

    @retry.retry(exceptions=[RpcException], tries=5, backoff=2, jitter=1)
    def my_external_rpc_call(...):
        ....

And RpcException will only be raised beyond this if the backend is unreachable for ~30 seconds.

Similarly, rpc services can abstract over this entirely, grpc (and presumably others) allow you to configure the retry policy per rpc service or method, and have it reflected everywhere that is used, without writing wrappers[0].

Which really all is to say, once you have solid libraries that handle retries of operations that are known to be error prone (file IO, network IO, things that could lock/block, etc.) you pretty quickly get into "any error implies we're totally boned".

[0]: https://github.com/grpc/grpc-go/blob/f601dfac73c9/examples/f...

[+] at_a_remove|3 years ago|reply

I struggle to find the correct descriptor for a counter-example, wherein You Really Want Success for the process as a whole, but it is acceptable for a sliver of it to fail, in the the context of ETL.

I have an ETL I am told (I switched jobs) that is still working, from 2008. It was built to be a tank, and I also did another forbidden thing: Pokemon Exception Handling. It's a guideline, not a law of physics, and it is fine to resort to a general error catch when you really don't know every possible error (and let's be honest, if you have enough libraries in the mix, some surprises will happen) and you want the other 99.999% of the data to go through. Yes, this one little thing didn't load, and let's log that, let's examine that and figure out how to prevent that going forward, but overall, the rest of the program must continue.

How did it get so tanklike? Every time a little bit failed and it got logged, I figured out what went wrong, fixed it, and then tried to generalize a class of similar errors. After a while, I got into Things I Was Told Would Never Happen in the data we ingested and programmed for when never happened. Reader, never came a little sooner than expected.

Anyway, I largely agree with the idea but there are places where you want the exact opposite, and I think it is important to look for those places lest this heuristic become so stiff it can lose utility.

[+] Mr_P|3 years ago|reply

For all the hate that Java tends to get, the language natively supports this distinction between:

* Expected errors - Checked Exceptions

* Unexpected errors - Unchecked Exceptions

Idiomatic Java also makes heavy use of asserts, e.g. using the Guava Preconditions library.

[+] WalterBright|3 years ago|reply

I learned from working on aviation systems is that when a system enters an unknown state, it must be disabled and locked out.

In software, this is known as an assertion failure. When the assert trips, the program is, by definition, in an unknown state. A program cannot reasonably be allowed to continue in an unknown state - it may launch nuclear missiles. The only thing to be done is exit directly, do not pass Go, do not collect $200.

[+] roeles|3 years ago|reply

Thanks for posting this. I have worked on non critical flight software and thought that this philosophy might work well.

I wonder how easy the certification is for such software? For work I might have to write Do178 code in the future.

[+] glouwbug|3 years ago|reply

What if the plane is mid flight?

[+] mmcnl|3 years ago|reply

I like this mindset.

[+] verdagon|3 years ago|reply

I think that the "let it fail" approach is often inevitable, even when we try to use Result<T, E>.

Often, we see an "unknown" variant in the error enum, as a catch-all for a library's unexpected errors. Then, anyone who calls them must also have an "unknown" enum. And anyone who calls them, and so on.

In the end, this "unknown" variant is similar to a panic, in that there's very few reasonable reactions to it: Log it, cancel the request, return error 500, perhaps retry.

For this reason, I often recommend people to just use assertions and panics.

[+] hyperhopper|3 years ago|reply

While everything you said is correct, there are still significant advantages to the 'result' method.

Sometimes you want to return 200 even if most of the backends fail. Sometimes one part may want to retry based on any error.

Even aside from this, disallowing exceptions leads to a very predictable control flow, and makes program state able to be expressed in the type system, which is useful for many reasons on it's own.

While yes, it's often just like an exception or panic, I'll take that over exceptions in my code any day

[+] kgeist|3 years ago|reply

>ignore unexpected exceptions

Isn't it the way it already is in practice, not something specific to Erlang? If an exception is unexpected, usually there won't be an exception handler for it, otherwise a developer pretty much expected it. Developers are generally lazy so in my practice the default is usually to let it fail, and there's usually going to be an exception handler that does something other than logging and quitting only if there's a serious reason to do so.

Maybe a more useful distinction could rather be "business logic errors" vs. everything else ("infrastructure errors", "programming errors" and "input validation errors"). Business logic should clearly define what should be done when an error happens, to avoid inconsistent state. But infrastructure-level errors or programming errors, you can't do much about them, other than log and/or retry.

[+] gabeio|3 years ago|reply

> If an exception is unexpected, usually there won't be an exception handler for it

That honestly depends on how the language and program are written, python is a great and horrible example of where you can handle any exception even ones that were just created by the program:

    try:
        crash_hard_here()
    except: # by default (and unfortunately) will catch *everything*
        pass # and this is one of the worst offenders inside an except, to just outright ignore the exception and continue as if nothing happened and to not even log it.

I can not tell you the amount of production code where I've seen catch all exceptions, and they are the lazy way to know something will not "crash" even though much worse things can happen now.

[+] czei002|3 years ago|reply

Other frameworks like express (nodejs) or actix (Rust) also don't crash if you "throw" in an request handler so this doesn't sounds very exciting to me. The interesting question for me is how retries are handled after an error occurred? For example, if the error happens in an http request handler, does the request still fails with 500 or is it magically retried by Erlang while keeping the request hanging? For internal service calls how are retries working? i.e. how can I configure that a request is retried after a failure? I guess Erlang does this and this is the power behind it?

The example of a missing file seems not very good since its a problem that is probably not solved by waiting. A better example is probably a busy DB that is temporary not reachable?

[+] rektide|3 years ago|reply

I spent a good part of this week overhauling a microservice where most fucntions were a giant try/catch & would maybe throw a new error. Just getting rid of the try catches & letting the code fail has been a huge help in seeing what is going wrong as the code executes.

I also am delighted to see the idea of expected errors here. Another thing I've been doing for a long time is tagging erros with an expected = true property when it's something we expect to see, like, oh, we went to get this oauth token but the credentials were wrong. Expectedness shows up in thr logs now & we can see mych more clearly where there are real problems.

[+] robocat|3 years ago|reply

The article doesn’t seem to look at how resources are cleaned up when a BEAM process crashes. https://elixirforum.com/t/understanding-the-advantages-of-le... says “All resources are owned by a process in Erlang, and the VM guarantees clean-up of resources once the process dies”. My Google-fu failed me when I searched for more details about Erlang process cleanup of resources, or how to register cleanup actions (e.g. delete some temporary file on crash).

[+] sergiotapia|3 years ago|reply

For our liveview project, a lot of the bugs we find are edge cases in the pattern match. We find the bug in appsignal, build another arity match and go on with our day. It's pretty cool.

I've been working in Elixir exclusively since 2016. I do think a lot of the Let It Fail is just marketing from Elixir (and BEAM) but there is a lot of truth in it. In reality you will most definitely not write everything under an explicit supervisor. You will just see errors in function clause matches and add another arity.

[+] rad_gruchalski|3 years ago|reply

The origin of Let it Crash dates to „Making reliable distributed systems in the presence of sodware errors”, Joe Armstrong, 2003: https://erlang.org/download/armstrong_thesis_2003.pdf, section 4.4.

[+] toast0|3 years ago|reply

> We find the bug in appsignal, build another arity match and go on with our day.

Failing (as crashing is now termed ;) immediately when the data didn't match the pattern is exactly the let it fail approach. If the data doesn't meet the expectations, there's nothing to do but crash. Maybe you've got a nice supervision tree, maybe not, but crashing immediately where things didn't match expectations usually gives you the right place to start looking; maybe it was some reasonable data, so you just handle it. Maybe it is unreasonable, so you need to look at where it came from, but usually (not always, of course) you just got the data and are pattern matching it, so you know where it came from too.

[+] stu2b50|3 years ago|reply

> In reality you will most definitely not write everything under an explicit supervisor.

That's the point of the supervisor _tree_. Certainly not every process will have its own supervisor, but all processes should be linked to other processes, which are linked to other processors, and at some point you have a process that is quite fundamental to the application and has a supervisor.

As long as there's a supervisor somewhere on that tree, the whole subtree will be restarted, hopefully in a non-erroneous state, and the application will continue on its merry way.

[+] torginus|3 years ago|reply

The article mentions Erlang, a functional language - which gives an interesting contrast - it is a functional language which are all about mapping out all behavior so that no undefined behavior can exist, and basically force you to consider every possibility (of course that doesn't account for stuff like network errors).

Wouldn't the same scheme be better suited for a procedural language, with deliberately dirty code full of gotchas?

I write a ton of hacky scripts, like last time I needed to rewrite 1000s of xml-s I wrote a crude regex replace for it. It worked 99% of the time, and I fixed up the rest manually. This sort of thinking - that a subprocess might fail due to whatever reason, including sloppy code - but the whole process will keep trucking on would be perfect for the this paradigm.

Additionally this would open the path for stuff like trivial hot code replacement - since in this system, a subprocess that crashes every time - like an invalid program would be handled by the system.

[+] winter_blue|3 years ago|reply

In a new language, I'd like to see exceptions being allowed in pure code, but prohibited in non-pure code. (Non-pure here meaning code with side effects.)

In pure code, an exception could essentially be passed up, and transformed into an error return value at the point where it's called by non-pure code.

[+] akdor1154|3 years ago|reply

I'd be very interested to see non-BEAM approaches to enabling this - i kind of end up in the same pattern thanks to "expected? Return an Error<E>. Unexpected? Throw." However, the supervising part is then difficult.

How do people approach this in Python? NodeJS? Rust? .NET?

[+] wolffiex|3 years ago|reply

This is the way. Exception handling is often one of the worst aspect of a production codebase, especially since it is typically added late. Though error handling strategies benefit from careful design, they are usually added piecemeal. Making errors louder and more problematic is the best way to get them the attention they deserve.

This is not a new concept and it seems to be one of the core components of the Erlang Weltanschauung. It can be generalized further to systems as the principle of "Crash-Only Software," as advanced in this classic paper: https://dslab.epfl.ch/pubs/crashonly.pdf

[+] smallerfish|3 years ago|reply

I've heard this expressed as "write brittle code", and I'm a strong advocate for it. Looking up a user by id, and get no results? Rather than either passing null up the stack, or wrapping null with an optional.empty, throw an exception! It's the client code's problem if it somehow got hold of an id that doesn't exist. (Yes, ymmv depending on the system, e.g. if you're dealing with eventual consistency then maybe do something different.)

As the article says, this of course doesn't mean that you shouldn't handle user errors, or even known system errors.

[+] _benj|3 years ago|reply

I think the distinction between expected and unexpected errors can easily fall through the cracks and writing code in a way that an unexpected error doesn’t break everything is quite powerful.

Golang makes it easy to ignore errors that can be ignored and defer/recover provide a way to implement a way to “let it fail”

There’s even an implementation of supervisor trees for Go [0] :)

[0] https://github.com/thejerf/suture

[+] rad_gruchalski|3 years ago|reply

This would be my go to for anything _supervisor_ in golang: https://github.com/asynkron/protoactor-go#supervision.

163 comments