(no title)
slowking2 | 5 months ago
For scientific simulations, I almost always want invalid state to immediately result in a program crash. Invalid state is usually due to a bug. And invalid state is often the type of bug which may invalidate any conclusions you'd want to draw from the simulation.
For data analysis, things are looser. I'll split data up into data which is successfully cleaned to where invalid state is unrepresentable and dirty data which I then inspect manually to see if I am wrong about what is "invalid" or if I'm missing a cleaning step.
I don't write embedded software (although I've written control algorithms to be deployed on it and have been involved in testing that the design and implementation are equivalent), but while you can't exactly make every invalid state unrepresentable you definitely don't punch giant holes in your state machine. A good design has clean state machines, never has an uncovered case, and should pretty much only reach a failure state due to outside physical events or hardware failure. Even then, if possible the software should be able to provide information to intervene to fix certain physical issues. I've seen devices RMA's where the root cause was the FPU failed; when your software detects the sort of error that might be hardware failure, sometimes the best you can do is bail out very carefully. But you want to make these unknown failures be a once per thousands or millions of device years event.
Sean is writing mostly about distributed systems where it sounds like it's not a big deal if certain things are wrong or there's not a single well defined problem being solved. That's very different than the domains I'm used to, so the correct engineering in that situation may more often be to allow invalid state. (EDIT: and it also seems very relevant that there may be multiple live systems updated independently so you can't just force upgrade everything at once. You have to handle more software incompatibilities gracefully.)
shepherdjerred|5 months ago
If you have actually made invalid states unrepresentable, then it is _impossible_ for your program to transition into an invalid state at runtime.
Otherwise, you're just talking about failing fast
cherryteastain|5 months ago
Not the case for scientific computing/HPC. Often HPC codebases will use numerical schemes which are mathematically proven to 'blow up' (produce infs/nans) under certain conditions even with a perfect implementation - see for instance the CFL condition [1].
The solution to that is typically changing to a numerical scheme more suited for your problem or tweaking the current scheme's parameters (temporal step size, mesh, formula coefficients...). It is not trivial to find what the correct settings will be before starting. Encountering situations like a job which runs fine for 2 days and then suddenly blows up is not particularly rare.
[1] https://en.m.wikipedia.org/wiki/Courant%E2%80%93Friedrichs%E...
neRok|5 months ago
I don't think the article is referring to that sort of issue, which sounds fundamental to the task at hand (calculations etc). To me it's about making the code flexible with regards to future changes/requirements/adaptions/etc. I guess you could consider Y2K as an example of this issue, because the problem with 6 digit date codes wasn't with their practicality at handling dates in the 80's/90's, but about dates that "spanned" beyond 991231, ie 000101.