top | item 22479533

(no title)

> The ultimate and very necessary defence of a real time system against arbitrary hardware error or operator error is the organisation of a rapid procedure for restarting the entire system.

"Just putting out the plug and stick it back in" is one common way nowadays of how to get out of an unforeseen state. It has quite some history and goes at least back to the "let it crash" philosophy of Erlang. Of course this still does not work for all kind of domains, especially when one is closer to the metal. But still, we may have found a sufficiency compromise between formal verified software (and thus, higher costs) and some kind of fault-tolerant software (increased productivity).

discuss

beetwenty|6 years ago

Or, in another word, "disposability". We have a lot of systems that aren't repairable, don't get debugged, don't have things fixed mid-flight.

And...it works, with respect to most existing challenges. Restarting and replacing is easy to scale up and produces clear interface boundaries.

One way in which it doesn't work, and which we still fail, is security. Security doesn't appear in most systems as a legible crash or a data loss or corruption, but as an intangible loss of trust, loss of identity, of privacy, of service quality. We don't know who ultimately uses the data we create, and the business response generally is, "why should you care?" The premise of so many of them, ever since we became highly connected, is to find profitable ways of ignoring and taking risks with security and to foster platforms that unilaterally determine one's identity and privileges, ensuring them a position as ultimate gatekeepers.

triangleman|6 years ago

I'm already starting to get tired of all these frequently-restarting Nomad jobs, and I've only been at it for a few months.