(no title)
kevg123 | 1 year ago
* Network is another basic that should be there
* Average disk service time
* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates
* TCP retransmits as a warning sign of network/hardware issues
* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing
* Per-CPU utilization
* Rates of operating system warnings and errors in the kernel log
* Application average/max response time
* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)
* Application thread pool utilization
* Rates of application warnings and errors in the application log
* Application up/down with heartbeat
* Per-application & per-thread CPU utilization
* Periodic on-CPU sampling for a bit of time and then flame graph that
* DNS lookup response times/errors
> Do you also keep tabs on network performance, processes, services, or other metrics?
Per-process and over time, yes, which are useful for post-mortem analysis
m3nu|1 year ago
nikisweeting|1 year ago
graycat|1 year ago
With a lot of monitoring someone might be interested.
jstummbillig|1 year ago
elashri|1 year ago
By observing trends rather than just static value ( a data point at specific time) you can get a better sense of whether your system is underutilizing or overutilizing swap space. For instance, if swap usage rates are consistently low but memory is under pressure, you might have vm.swappiness set too low. Conversely, if swap I/O is high, it could indicate that swappiness is too high.
This is a poor man’s approach, and there are definitely more sophisticated ways to handle this task, but it’s a quick solution if you just need to get some basic insights without too much work.
gorkemcetin|1 year ago
ozim|1 year ago
For example I monitor disk space, RAM, CPU and that’s it for external tooling.
If any of that goes above thresholds someone will log into the server and use windows or Linux tooling to check what is going on.
I mostly monitor services health check endpoints so http calls to our own services. If network is down or shoddy response times of the services.
So all in all not much of servers itself.