top | item 46247383

(no title)

strifey | 2 months ago

Staring down the barrel of being primary on-call over Christmas for a dozen k8s clusters running thousands of nodes. How I wish it were true that we could trust computer programs to just keep running.

PagerDuty wouldn't exist if this were true.

discuss

order

nitwit005|2 months ago

If your work place has a long enough history, try comparing incidents on work days versus weekends or holidays. Typically the incident rate is dramatically lower when no one is making changes.

strifey|2 months ago

Totally true, but we host other people's code (PaaS, etc). We don't get to dictate their working hours.

It also doesn't mean nothing breaks when people aren't making changes. Certificate expiration is the classic example of something breaking _because_ someone hasn't made a change. Or a slow memory leak. There's a whole classification of issues that get worse when nothing is redeployed for long enough.