(no title)
ivanstojic | 1 year ago
Longer reply:
I have on-call experience for major services (DynamoDB front door, CosmosDB storage, OCI LoadBalancer). Seen a lot of different philosophies. My take:
1. on-call should document their work step by step in tickets and make changes to operational docs as they go: a ticket that just has "manual intervention, resolved" after 3 hours is useless; documenting what's happening is actually your main job; if needed, work to analyze/resolve acute issues can be farmed out
2. on-call is the bus driver, shouldn't be tasked with handling long term fixes (or any other tasks beyond being on-call)
3. handover between on-calls is very important, prevents accidentally dropping the ball on resolving longer time horizon issues; handover meetings
Probably the most controversial one: separate rotation (with a longer window - eg. 2 week) should handle tasks that are RCA related or drive fixes to prevent reoccurrence
Managers should not be first tier on any pager rotation, if you wouldn't approve pull requests, you shouldn't be on the rotation (other than as a second tier escalation). Reverse should also hold: if you have the privilege to bless PRs, you should take your turn in the hot seat.
No comments yet.