top | item 9011590

(no title)

jmsduran | 11 years ago

It seems, especially for major corporations, that on-call/pager duty is quickly becoming the norm for software development teams. I do agree that pager duty is a symptom of a fundamental flaw within the system/architecture. I think it would be in a company's best interest to devote time in improving the reliability and stability of their infrastructure, instead of relying on the band-aid approach that pager duty seems to be.

Regarding #8 though, when you are pressured to resolve a complex issue within a short time window, it can absolutely induce a sense of panic for those who do not handle stress well. In my opinion, I believe the remedy for this would be to have two individuals designated as on-call at a time, assuming the team is large enough.

discuss

order

devicenull|11 years ago

> It seems, especially for major corporations, that on-call/pager duty is quickly becoming the norm for software development teams. I do agree that pager duty is a symptom of a fundamental flaw within the system/architecture. I think it would be in a company's best interest to devote time in improving the reliability and stability of their infrastructure, instead of relying on the band-aid approach that pager duty seems to be.

I can't see there ever being a time where there is no on-call requirement. You always need someone standing by in case of some terrible disaster that cannot be handled automatically. Better to have this a formal responsibility that never gets used, then to not have it and end up with an extended downtime because you can't contact anyone.

That being said, if you're getting paged continuously during on-call, then there's a bigger problem that needs to be resolved.

_delirium|11 years ago

> You always need someone standing by in case of some terrible disaster that cannot be handled automatically.

If it's a really terrible disaster, a once-a-decade kind of thing where everything goes haywire and you need as many staff as possible to get online ASAP, then yes. But aren't we talking more about the kinds of "disasters" that happen once a month or so, and can be handled by a few staff (not waking up the whole team). To me that sounds more like just staffing for normal operations.

At large engineering companies this is typically handled via literally having someone standing by, i.e. formally on duty, rather than having off-duty employees be on pager duty. There'll be at least a bare-bones staff on the after-hours shift (probably not in all offices, but in some kind of 24/7 operations center), enough of a staff that reasonably foreseeable things can be handled. Of course there are some pros and cons to that from an employee perspective. On the one hand the night shift isn't that pleasant, but on the other hand your responsibilities are at least formally limited to 40 hours/wk; if you're on night shift one week, you don't come in during the day, or carry a pager during the day.

TheSwordsman|11 years ago

This seems like a very naive response. We run on hardware that's lifetime is quantified not whether it will fail, but when it will fail. You don't know when that is, or how it will fail. The node could completely go away, or degrade enough that it begins to impact performance.

We also run persistent systems across the WAN. And, unfortunately, some of these things require the state to be maintained.

You can't just design these systems to be "better". There are often things outside of your control.

Based on your response, you seem to be the type of person causing pain for those with a pager.

Also, I'm sure the company that can make the Internet work every time, all the time, will make a killing.

taco_emoji|11 years ago

Pager duty is not a band-aid. It CAN be, for poorly-managed companies, but even the most conscientious and knowledgeable company in the world is going to have unexpected failures.