top | item 46755360

(no title)

nik282000 | 1 month ago

I work at a plant with a site wide SCADA/HMI (Siemens WinCC) system, every alarm is displayed on every HMI regardless of its proximity to the machine or even its ability to address the issue. And any given minute a hundred or more alarms can be generated, the majority being nuisance messages like "air pressure almost low" or my favorite " " (no message set) but scattered among those is the occasional "no cooling water - explosion risk".

This plant is operated and deigned to the spec of an international corp with more than 20 factories, it's not a mom-and-pop operation. No one seems to think the excessive, useless, alarms are an issue and that any damage caused by missed warnings is the fault of the operator. When approaching management and engineering about this the responses range from "it's not in the budget" to " you're maintenance, fix all the problems and the alarms will go away".

The only way for this kind of issue to be resolved is with regulation and safety standards. An operator can't safely operate equipment when alarms are not filtered or sorted in some way. It's like forcing your IT guy to watch web server access logs live to spot vulnerabilities being exploited.

discuss

terminalshort|1 month ago

This is a fundamental organizational and societal problem. An engineer would look at the situation and think "what is the best way to get the failure rate below a tolerable limit?" But a lawyer looks at the situation and thinks "how do I minimize liability and bad PR?" and a bureaucrat thinks "how can I be sure the blame doesn't land on me when something goes wrong?" And the answer to both of those questions is to throw an alarm on absolutely everything. So if there is a problem they can always say "our system detected the anomaly in advance and threw an alarm." Overall the system will be less safe and more expensive, but the lawyer's and bureaucrat's problems are solved. Our society is run by lawyers and bureaucrats, so their approach will win out over the engineer's. (And China's society is run by engineers, so it will win out over ours.)

gopher_space|1 month ago

Up to a certain point society is run by actuaries. Finding someone at your insurance company who both understands the problem with excess errors and appreciates how easily enumerable they are would be an interesting "whistleblowing" target.

pstuart|1 month ago

> This is a fundamental organizational and societal problem

Absolutely, and we'd collectively be better served if we had tools to deal with it.

I think of it as "incentive ecology" -- as noted, everybody has their own incentives which shapes their behavior, which causes downstream issues that begin the process anew.

Obviously there's no simple one-shot solution to this, but what if we had ways to simplify and model this "web of responsibility" (some sort of game theory exposed as an easily consumed presentation, with computed outcomes that show the cost/ROI/risk/reward) that could be shared by all stakeholders?

Obscurity and deniability are the weapons wielded in most of these scenarios, so what if we could render them obsolete?

Sure, those in power would not want to yield their advantages, but the overall outcomes should reward everybody by minimizing risks and maximizing rewards for the enterprise and everybody wins.

Yes, I'm looking at it as a an engineer and a dreamer, but I think if such a tool existed that was open source and easily accessible that this work could be done by rogue participants that could put it out there so it's undeniable.

bluGill|1 month ago

Courts do accept alarm fatigue and if there is an injury/death and there were many alarms you can bet that whatever lawyers' side benefits will bring in experts to explain the issue.

if there are a lot of issues the lawyers will also ask why they were not corrected first: using that to establish a pattern of bad maintenance.

renewiltord|1 month ago

Is it though? Engineer can optimize on different manifold. Company can succeed/fail for different reasons. Getting destroyed for legal suit because didn’t place alarm is small peace when you did better engineering.

After all, read any post-mortem comments on HN. Many of those people can be hired as expert if you like. They will say “I would have put an alert on it and had testing”. You will lose the case.

“Oh but we are trying to keep error rate low”. Yes, but now your company is dead when high error rate company is alive.

In revealed preferences, most engineers prefer vendors who have CYA. This is obvious from online comments. This is not because they are engineer. It’s because most people want to believe that event is freak accident.

Building system for error budget is not actually easy. Even for engineer who say they want it. Because when error happens, they immediately say it should not have happened. Counterfactual other errors blocked, and business existing are not considered. Every engineer is genius in hindsight. Every person is genius in hindsight.

Why these genius never make failure proof company? They do not. Who would not pay same price for 100% reliable tech?

mmooss|1 month ago

The first step in problem solving is to look in the mirror. It's not surprising that in an engineering community, the instinct is to blame outsiders - lawyers, bureaucrats, managers, finance, etc. - because those priorities are more likely to conflict with engineering, because it is harder to understand such different perspectives, and because it is easier to believe caricatures of people we don't know personally.

Those people have valuable input on issues the engineer may not understand and have little experience with. And engineers are just as likely to take the easy way out, like the caricature in the parent comment:

For example, for the manufacturer's engineering team it's much easier, faster and cheaper to slap an alarm on everything than to learn attention management and to think through and create an attention management system that is effective and reliable (and it had better be reliable - imagine if it omits the wrong alarms!). I think anyone with experience can imagine the decision to not delay the project and increase costs for that involved subproject - one that involves every component team, which is a priority for almost none of them, and which many engineers, such as the mechanical engineer working on the robotic arm, won't even understand the need for.

> And China's society is run by engineers, so it will win out over ours.

History has not been kind to engineers who do non-engineering, such as US President Herbert Hoover who built dams and but also had significant responsibility for the Great Depression. It's not that engineers can't acquire other skills and do well in those fields, but that other skills are needed - they aren't engineering. Those who accept as truth their natural egocentric bias and their professional community's bias toward engineering are unlikely to learn those skills.

anonymousiam|1 month ago

The criticality of the alerts should be classified, and presented with the alert. Users should have the ability to filter non-critical messages on certain platforms.

Unfortunately, some systems either don't track criticality, or some of the alerts are tagged with the wrong level.

(One example of the latter is the Ruckus WAP, which has a warning message tagged at the highest level of criticality, so about two or three times a month, I see the critical alert: "wmi_unified_mgmt_rx_event_handler-1864 : MGMT frame, ia_action 0x0 ia_catageory 0x3 status 0x0", which should be just an informational level alert, with nothing to be done about it. I've reported this bug to Ruckus a few times over the past five years, but they don't seem to care.)

varjag|1 month ago

In reality users will keep everything on default.

miki123211|1 month ago

Useless warnings are a great CYA tactic.

THe more of them you have, the more likely it is that there's a warning if something happens. Whether the warning is ever noticed is secondary, what matters is the fact that there was a warning and the operator didn't react to it appropriately, which makes the situation the fault of the operator.

cucumber3732842|1 month ago

This is partly a problem with our workplace laws.

In the eyes of the regulators and courts individual low level employees can not take responsibility. This is the logic by which they fine the company when someone does something you shouldn't need to be told not to do on a step ladder or whatever.

What this means is that low level employees become liability sinks. Show them all the warnings and make them figure it out. Give them all sorts of conflicting rules and let them sort out which ones to follow. Etc, etc.

varjag|1 month ago

I think it's regulated in places, as it was certainly an HMI concern ever since Three Mile Island. Our customer is really grilling vendors for generating excessive alarms. Generally for a system to pass commissioning it has to be all green, and if it starts event bombing after you're going to be chewed.

nik282000|1 month ago

I have never seen a piece of new equipment that ever gets to an All Green state, before, during or after commissioning. I frequently recommend that we do not allow the commissioning team to leave until they can get it to that state but it has yet to happen.

CamperBob2|1 month ago

The only way for this kind of issue to be resolved is with regulation and safety standards.

Are you sure that's not what caused the problem in the first place? Unqualified and/or captured regulators who come up with safety standards that are out of touch with how the system needs to work in the real world?

AlotOfReading|1 month ago

Do regulators come up with SCADA safety standards? I would have assumed it was IEC.

Regulators coming up with engineering standards is pretty rare in general. Usually they incorporate existing professional standards from organizations like SAE, IEEE, IEC, or ISO.

lostdog|1 month ago

I wonder if you could calculate a "probability of response to major alert" and make it the inverse of the total or irrelevant alerts. Then you get to ask "our probability of major alert saliency is onlt 6%. Why have the providers set it at this level, and what can we do to raise it?"

bsder|1 month ago

The Three Mile Island disaster had similar problems with notifications.

The problem at TMI was that the teletypewriter delivering the alerts wasn't fast enough to finish typing before new alerts came in. As time went on, the information it was emitting got further and further behind. Even if the operators wanted to make intelligent decisions, they were operating on hours old data that no longer applied.