top | item 45649966

(no title)

Most of the time checking for "typical" thresholds for infrastructure will yield more noise than signal. By typical thresholds I mean things like CPU Usage %, Memory Consumed and so on. My typical recommendation for clients who want to implement infrastructure is "Don't bother". You are better off in most cases measuring impact to user-facing services such as web page response times, task completion times for batch jobs and so on. If I have a client who is insistent on monitoring their infrastructure I tell them to monitor different metrics.

For CPU, check for CPU IOWait For memory, check for Memory swap-in rate For disk, check for latency or queue depth For network, check for dropped packets

All you want to check at an infrastructure layer is whether there is a bottleneck and what that bottleneck is. Whether an application is using 10% or 99% of available memory is moot if the application isn't impacted by it. The above metrics are indicators (but not always proof) that a resource is being bottlenecked and needs investigation.

Monitor further up the application stack, check for error code rates over time, implement tracing to the extent that you can for core user journeys, ignore infrastructure-level monitoring until you have no choice

discuss

No comments yet.