top | item 43403922

(no title)

jjtang1 | 11 months ago

>Most of the software I wrote requires mostly no interference or fix-ups, unless of course the requirements have changed

Bug-free software is great, but changing requirements are precisely the reason on-call needs to exist.

Even if you have a whole team of engineers who write bug-free software like this guy, you'll still have failures. Because the world is constantly slipping out from under your assumptions.

Customers never stop changing their usage patterns. They add load at different rates, come up with unexpected requests of all shapes and sizes, and invent new use cases that fly in the face of the original project requirements.

Even if you have created a software system with no bugs that perfectly meets both the functional and non-functional requirements of the project, changes in the state of the world vis a vis customer behavior will come along and change what counts as a bug. If your system has a blanket 60-second database query timeout, and everything's working fine, then there's no bug. But as soon as a new API usage pattern causes certain queries to run on average 10 times longer than before, now you have connection starvation and an urgent bug to fix.

I'm not saying that "timely maintenance and improvement" and "a culture of perpetual ownership" won't have positive effects on reliability. But it's unrealistic that any amount of responsible, careful software development will fully eliminate the occurrence of sudden and unexpected failures. Human on-call, as uncomfortable as it is, will remain a requirement as long as reliability is taken seriously.

FWIW my perspective is that of someone that runs an on-call/incident management platform (Rootly).

discuss

No comments yet.