top | item 41278373

(no title)

kevg123 | 1 year ago

> What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?

* Network is another basic that should be there

* Average disk service time

* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates

* TCP retransmits as a warning sign of network/hardware issues

* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing

* Per-CPU utilization

* Rates of operating system warnings and errors in the kernel log

* Application average/max response time

* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)

* Application thread pool utilization

* Rates of application warnings and errors in the application log

* Application up/down with heartbeat

* Per-application & per-thread CPU utilization

* Periodic on-CPU sampling for a bit of time and then flame graph that

* DNS lookup response times/errors

> Do you also keep tabs on network performance, processes, services, or other metrics?

Per-process and over time, yes, which are useful for post-mortem analysis

discuss

m3nu|1 year ago

Those are some great ideas for Prometheus alert rules. If they aren't already added here: https://samber.github.io/awesome-prometheus-alerts/

nikisweeting|1 year ago

IO wait time for disks is a great one too for catching IO load, `glances` and `atop` do a good job of surfacing it when it's an issue.

graycat|1 year ago

With all that, might want some good automatic anomaly detection. While at IBM's Watson lab, I worked out something new, gave an invited talk on the work at the NASDAQ server farm, and published it.

With a lot of monitoring someone might be interested.

jstummbillig|1 year ago

How to identify a mistuned vm.swappiness?

elashri|1 year ago

I rely on a heuristic approach which is to track the rate of change in key metrics like swap usage, disk I/O, and memory pressure over time. The idea is to calculate these rates at regular intervals and use moving averages to smooth out short-term fluctuations.

By observing trends rather than just static value ( a data point at specific time) you can get a better sense of whether your system is underutilizing or overutilizing swap space. For instance, if swap usage rates are consistently low but memory is under pressure, you might have vm.swappiness set too low. Conversely, if swap I/O is high, it could indicate that swappiness is too high.

This is a poor man’s approach, and there are definitely more sophisticated ways to handle this task, but it’s a quick solution if you just need to get some basic insights without too much work.

gorkemcetin|1 year ago

That is a good list, now just need to prioritize (after finding the ICP).

ozim|1 year ago

Before you start adding all of that make sure you have customers like parent poster.

For example I monitor disk space, RAM, CPU and that’s it for external tooling.

If any of that goes above thresholds someone will log into the server and use windows or Linux tooling to check what is going on.

I mostly monitor services health check endpoints so http calls to our own services. If network is down or shoddy response times of the services.

So all in all not much of servers itself.