top | item 43816131

(no title)

poobear22 | 10 months ago

I managed the system administrators for a high performance computing center. We took a lot of blame for the applications when in reality, often times it was poor programming on the developer's part. So, I got really tired of taking the blame and implemented statistical process control to track the mean time between failures of the jobs. I was really just shining the flashlight on production jobs and was hoping it could change the culture. It was not my job to fix their code, and the applications were developed by a different group of people with a very different culture. I thought the process control worked really well, and it did allow me to take the heat off me for random blaming of my team, when I could respond with "your job is failing XX times per year" and from there, push for a root cause analysis. But pushing against that culture was really hard, and there was a lot of "set the job to complete and I will look at it on Monday". If they do not want to conduct a root cause analysis on the failure modes for their code, I can't do much. So, even implementing some type of monitoring can have little effect if the ones who need to fix something do not support the culture. And, as I read your post, I'd think people would be looking at these business metrics a little closer or develop more sensitive metrics to catch these issues.

discuss

chipfixer|10 months ago

Yup, no amount or type of anomaly detection can fix the culture. That said, in this case, maybe one reason it may be hard is the devs weren't the ones owning what the job did in production?