top | item 30106575

(no title)

The name needs to change but also the attitude that as engineers, we build complex systems and assume everyone has the knowledge how to use it. A few world wide outages I've been a part of was caused by a task runner which didn't lint the command and allowed a broken bash one-liner to be executed across every system in parallel.

Yes, it's a simple mistake but how was a system allowed access to our global environment that this edge case was never calculated? In many of the meetings, the common issue is communication even between co-workers on the same team, and between internal platform providers. One case was an outage on the storage backend and realized after a long meeting that the internal SLA was much greater than we expected (and which the systems would timeout). It only worked for so long as storage utilization was extremely low.

discuss

eternityforest|4 years ago

That means we need to take a very close look at whether "Real programmers" are actually anyone to emulate.

Programming culture has almost football field level of "No time for weakness" attitude.

If I see a possible failure mode of a system, and bring it up, someone's going to tell me to stop being a clicky click windows idiot and learn to be careful.

Trying to prevent human error in software isn't seen as a priority so nobody does it. They are concerned with the most reliable code rather than the most reli4 code-user-hardware-task-schedule-conditions system.

Programmers need to accept software fixes for human and hardware failures. It's a lot easier to add a confirmation dialog than it is to somehow become 100% reliable at not clicking the wrong thing.