(no title)
jbernsteiniv | 1 year ago
I once had an internal customer open an on call event to ask why one of their machines was running so slowly. I said "it's because one of the DIMMs has thrown about 30,000 correctable errors within the past month". I was able to correlate that by mapping the EDAC label for the DIMM recorded in /var/log/messages and some gzipped archives of the aforementioned log file.
Of course I deal with CPU, memory, motherboards, GPUs, add-on NICs (OCP or PCIe), storage controllers (HBAs mostly, some RAID controllers), BMCs, and of course I also have to evaluate the link width of PCIe bridges interconnecting all the PCIe devices.
No comments yet.