top | item 15039410

(no title)

I've been in environments without any sense of monitoring; pingdom and esx health checks that provide only a tiny frame of reference to the real health of an application.

Being able to know the true health of a service is an absolute godsend.

So many times a service had been dead or gone for hours before anyone noticed (well our customers noticed, but it has to funnel up the pipeline from customer, to support, to engineering) before we were made aware of a real issue.

Nothing says good PR like not knowing you've been dead in the water for half a day and have no idea.

discuss

ahakanbaba|8 years ago

I absolutely agree. For a service with any availability guarantees there has to be rigorous monitoring and alerting.

This also holds for services that have internal clients. In other words, if your output is consumed only by other services in the same company, the same high monitoring standards must apply. Otherwise failure detection becomes very delayed and the productivity of many teams gets affected. There is no worse buzzkill than explaining other service owners what is wrong with their application.

One other important lesson we have earned is that alerts require time to mature. The thresholds need to be trained, the alert formulation needs to be revised. Our alerts usually give couple of false positives in the first two weeks of their creation. During these two weeks we frequently improve the conditions of alerts.