top | item 41654521

(no title)

TL;DR: on-call manages acute issues, documents steps taken, possibly farms out immediate work to subject matter experts. Rate on-call based on traces they leave behind. Separate on-call with same population, but longer rotation window handles fixes. Rate this rotation based on root cause reoccurrence and general ticket stats trendlines.

Longer reply:

I have on-call experience for major services (DynamoDB front door, CosmosDB storage, OCI LoadBalancer). Seen a lot of different philosophies. My take:

1. on-call should document their work step by step in tickets and make changes to operational docs as they go: a ticket that just has "manual intervention, resolved" after 3 hours is useless; documenting what's happening is actually your main job; if needed, work to analyze/resolve acute issues can be farmed out

2. on-call is the bus driver, shouldn't be tasked with handling long term fixes (or any other tasks beyond being on-call)

3. handover between on-calls is very important, prevents accidentally dropping the ball on resolving longer time horizon issues; handover meetings

Probably the most controversial one: separate rotation (with a longer window - eg. 2 week) should handle tasks that are RCA related or drive fixes to prevent reoccurrence

Managers should not be first tier on any pager rotation, if you wouldn't approve pull requests, you shouldn't be on the rotation (other than as a second tier escalation). Reverse should also hold: if you have the privilege to bless PRs, you should take your turn in the hot seat.

discuss

No comments yet.