top | item 37834903

(no title)

slap_shot | 2 years ago

I'm surprised how often I speak to technical teams that do not utilize PagerDuty (or an equivalent alternative). As PagerDuty integrates with nearly any external system, it separates the collection of telemetry from the incident response lifecycle, i.e. what is wrong? who should be or is looking into this? what did we learn from this? how often is this happening?

Personally, I find notifications in Slack to be an anti-pattern: a lot of teams expect someone to just "pick up" the incident based on their availability or expertise and _maybe_ the resolution is documented. Assigning direct responsibility by component and on-call schedule appending the RCA reduces the time-to-resolution and overall toil of the process.

discuss

order

No comments yet.