Staring down the barrel of being primary on-call over Christmas for a dozen k8s clusters running thousands of nodes. How I wish it were true that we could trust computer programs to just keep running.
If your work place has a long enough history, try comparing incidents on work days versus weekends or holidays. Typically the incident rate is dramatically lower when no one is making changes.
Totally true, but we host other people's code (PaaS, etc). We don't get to dictate their working hours.
It also doesn't mean nothing breaks when people aren't making changes. Certificate expiration is the classic example of something breaking _because_ someone hasn't made a change. Or a slow memory leak. There's a whole classification of issues that get worse when nothing is redeployed for long enough.
nitwit005|2 months ago
strifey|2 months ago
It also doesn't mean nothing breaks when people aren't making changes. Certificate expiration is the classic example of something breaking _because_ someone hasn't made a change. Or a slow memory leak. There's a whole classification of issues that get worse when nothing is redeployed for long enough.