top | item 45648078

(no title)

tudelo | 4 months ago

Alerting has to be a constant iterative process. Some things should be nice to know, and some things should be "halt what you are doing and investigate". The latter needs to really be decided based on what your SLI/SLAs have been defined as, and need to be high quality indicators. Whenever one of the halt-and-do things alerts start to be less high signal they should be downgraded or thresholds should be increased. Like I said, an iterative process. When you are talking about a system owned by a team there should be some occasional semi-formal review of current alerting practices and when someone is on-call and notices flaky/bad alerting they should spend time tweaking/fixing so the next person doesn't have the same churn.

There isn't a simple way but having some tooling to go from alert -> relevant dashboards -> remediation steps can help cut down on the process... it takes a lot of time investment to make these things work in a way that allows you to save time and not spend more time solving issues. FWIW I think developers need to be deeply involved in this process and basically own it. Static thresholds usually would just be a warning to look at later, you want more service level indicators. For example if you have a streaming system you probably want to know if one of your consumers are stuck or behind by a certain amount, and also if there is any measurable data loss. If you have automated pushes, you would probably want alerting for a push that is x amount of time stale. For rpc type systems you would want some recurrent health checks that might warn on cpu/etc but put higher severity alerting on whether or not responses are correct and as expected or not happening at all.

As a solo dev it might be easier just to do the troubleshooting process every time, but as a team grows it becomes a huge time sink and troubleshooting production issues is stressful, so the goal is to make it as easy as possible. Especially if downtime == $$.

I don't have good recommendations for tooling because I have used mostly internal tools but generally this is my experience.

discuss

order

yansoki|4 months ago

This is an incredibly insightful and helpful comment, thank you. You explain exactly what I thought when writing this post. The phrase that stands out to me is "constant iterative process." It feels like most tools are built to just "fire alerts," but not to facilitate that crucial, human-in-the-loop review and tweaking process you described. A quick follow-up question if you don't mind: do you feel like that "iterative process" of reviewing and tweaking alerts is well-supported by your current tools, or is it a manual, high-effort process that relies entirely on team discipline? (This is the exact problem space I'm exploring. If you're ever open to a brief chat, my DMs are open. No pressure at all, your comment has already been immensely helpful, thanks.)