(no title)
killertypo | 8 years ago
Being able to know the true health of a service is an absolute godsend.
So many times a service had been dead or gone for hours before anyone noticed (well our customers noticed, but it has to funnel up the pipeline from customer, to support, to engineering) before we were made aware of a real issue.
Nothing says good PR like not knowing you've been dead in the water for half a day and have no idea.
ahakanbaba|8 years ago
This also holds for services that have internal clients. In other words, if your output is consumed only by other services in the same company, the same high monitoring standards must apply. Otherwise failure detection becomes very delayed and the productivity of many teams gets affected. There is no worse buzzkill than explaining other service owners what is wrong with their application.
One other important lesson we have earned is that alerts require time to mature. The thresholds need to be trained, the alert formulation needs to be revised. Our alerts usually give couple of false positives in the first two weeks of their creation. During these two weeks we frequently improve the conditions of alerts.