(no title)
tudelo | 4 months ago
There isn't a simple way but having some tooling to go from alert -> relevant dashboards -> remediation steps can help cut down on the process... it takes a lot of time investment to make these things work in a way that allows you to save time and not spend more time solving issues. FWIW I think developers need to be deeply involved in this process and basically own it. Static thresholds usually would just be a warning to look at later, you want more service level indicators. For example if you have a streaming system you probably want to know if one of your consumers are stuck or behind by a certain amount, and also if there is any measurable data loss. If you have automated pushes, you would probably want alerting for a push that is x amount of time stale. For rpc type systems you would want some recurrent health checks that might warn on cpu/etc but put higher severity alerting on whether or not responses are correct and as expected or not happening at all.
As a solo dev it might be easier just to do the troubleshooting process every time, but as a team grows it becomes a huge time sink and troubleshooting production issues is stressful, so the goal is to make it as easy as possible. Especially if downtime == $$.
I don't have good recommendations for tooling because I have used mostly internal tools but generally this is my experience.
yansoki|4 months ago