top | item 12951102

So You've Been Paged: A Guide to Incident Response

46 points| kawera | 9 years ago |blog.scalyr.com

58 comments

order
[+] itsmemattchung|9 years ago|reply
> Pager duty is essentially wage theft. I disagree—it's a part of the job; I dare say it's an important part of the job ... equally important if you are a developer. You write software and you get paged when it breaks ... how is that wage theft? I find no feedback more effective.

What I do consider frustrating, however, is when I'm responsible for alarms/incidents that I can take no corrective action; this is how friction grows between teams: dev and ops.

[+] TrevorJ|9 years ago|reply
On a company level in the market there are Service Level Agreements (with requisite adjustments in price) for this sort of thing. It's clear that the market assigns a value to different response times. Further, if there isn't an value to having an employee on pager duty then the company would not do it. If that value is not reflected when compensating the employee then from an economic perspective, yes, the employee is losing out.
[+] hiou|9 years ago|reply
My quick guide to pager duty.

Step 1: Find a new job.

There are way too many opportunities out there to subject yourself to this nonsense. Pager duty is essentially wage theft.

[+] kyrra|9 years ago|reply
I think this is because most companies do pager duty wrong. I highly recommend the google SRE book[0] (notes here[1], chapter 11 covers oncall/pager). One thing mentioned in this book is compensation for being oncall. At Google we get fairly decent pay compensation for holding the pager, enough where it can incentivize people to be on the rotation.

(I'm a software engineer at google who is oncall at this moment)

[0] http://shop.oreilly.com/product/0636920041528.do

[1] http://danluu.com/google-sre-book/

[+] module0000|9 years ago|reply
I agree that it is wage theft the majority of the time. I had a previous gig that paid a 20% salary bonus for months you were on pager duty. If the pager went off, you got a 10% weekly bonus that increased with each incident up to 40% maximum. People fought like cats and dogs to get on pager duty at that place...I wish more businesses took that approach.
[+] jstanley|9 years ago|reply
Completely agree.

If it's unimportant enough that the company can't afford to pay you a large multiple of your normal salary to work on it out-of-hours, it's unimportant enough that it can wait until you're next in the office.

[+] AndrewKemendo|9 years ago|reply
So who should support emergency issues? Nobody?
[+] Chris2048|9 years ago|reply
This only counts if you aren't highly compensated for the unpredictable hours.
[+] Symbiote|9 years ago|reply
Step 0: Agree, on the condition that any alert received outside working hours becomes a priority task for someone the next day -- whether that means them fixing a bug, adjusting the alert, or investigating and explaining why the cause is extremely unlikely to recur.

This is working well for me, has improved the service for users, and has made our monitoring system much more useful. (Alerts used to be about as accurate as "Main website broken", but are now more like "microservice X is taking >10s to respond".)

[+] Theodores|9 years ago|reply
Last time I was on paid pager duty there was a slim likelihood that there would be any chance of the 3 a.m. call. Therefore we had a strict rota as it would have been unfair if someone was getting all of the free cash. This only amounted to £300 a month due to everyone wanting to be on the rota, however, I am sure that paid for my Christmas by the end of the year.

Plus the minor incidents that did happen also turned out to be good team moments, everyone would hear that 'you fixed it'. Okay you did reboot-retry on that file server that wasn't responding and everything was fine five minutes later, but a lot of knowledge went into pressing that reboot button and, as far as manager types are concerned, you saved the day. To their non-technical minds that could be voodoo wizardry so they are pleased, feather in cap given.

I also believe that doing some out of hours emergency support is good for one's own education, you are able to think quickly on your feet as it is a heightened urgent situation at hand. Just being placed in this position helps you gain experience of this type of problem solving. Total focus on the task in hand is easily achievable, one is not thinking about lunch or going home and only half focused.

Once you have that experience then find a new job!

[+] codingdave|9 years ago|reply
> Pager duty is essentially wage theft.

It is only reasonable to hold that stance if it goes both ways. If you consider it wage theft to expect you to work on-call hours, then you also must consider it proper work ethic to always work 40 hours a week, never taking an afternoon to... well, do anything. 40 hours, no more, no less. Don't be late in the morning, either.

If that is how you want to work, there are certainly jobs that offer it.

[+] hsod|9 years ago|reply
I don't see how it's wage theft if you were aware of it when you accepted the job.
[+] CalChris|9 years ago|reply
It ranks up there with unpaid internships (which should be illegal).
[+] secretRubyDev|9 years ago|reply
I work remotely on retainer for a company I worked for when I lived back in Oakland. I work 24/7 pager duty effectively. A lot of times I'll be summoned after working 8-10 hours to just look things up for an important client or confirm sales numbers (which always are correct). I work ~9AM-6PM EST hours but folks at the company generally work 12PM-8PM PST so there's really no good way to plan around those sort of support calls.

It wasn't always this way though. The company used to have other developers but they never replaced them when they left. The business unit I work under switched to a maintenance mode effectively where we're just upgrading existing systems for CVEs, supporting the AWS setup, and dealing with important client requests when they rarely come in.

I'd push for more money but there just isn't the budget for it and they've made it clear to me. I even had to fight to stay full time as they wanted me to work less days but be on call still, "we need to figure out the best way to utilize our resources" (paraphrased).

I will say it's detrimental to my health. I wake up most mornings and immediately grab my phone out of fear I missed an alert in the night. That whole bit about utilizing their resources hasn't been sitting well with me though so I'll probably move on after I finish the extra documentation of our systems they're now pushing for.

Just putting a counter point to the people saying, "You signed up for this, you should know what you were doing." It's not always that simple. I have a family now and can't just pick up and change everything at a whim.

I'll also say I get nervous going into movie theaters, etc. where I'll be disconnected for a few hours. It's just not a healthy situation at all.

[+] protomyth|9 years ago|reply
> The good news is this: All issues eventually get resolved (unless you just give up and quit. Please don’t do that.)

Well, that isn't exactly true. The issue might continue based on the way work is scheduled in your project. Its amazing how the business and managers often don't schedule the "stop problem for happening" work when it becomes apparent that the support / devops staff can fix production issues themselves. I've been there and watching the business and your managers reject the time needed to fix the problem during the next iteration / release / sprint is soul killing. It really doesn't bring as much business value as these new features after all.

[+] Ghostium|9 years ago|reply
Please increase the contrast!
[+] ransom1538|9 years ago|reply
After a few years in management, when it comes to paging there are two groups:

1) People that want to fix their own code. They usually don't have to be told "you have pager duty" -- they just care. They even become upset if they don't know about the page and will even install their own paging systems without your knowledge.

2) People that can't be bothered. They dont' answer the phone, don't answer slack - instead they just watch a re-run of the mindy project and eat ice cream while your production system is crashing.

Generally people in 1) fix things for 2). Wage theft is really from group 2) - because they steal from 1). Each time I have fixed something for group 2) - I either fire them personally monday or start an all out campaign to get them fired. Firing is hard. But letting the entire team down just makes it easy. Ironically, if you want less pages just fire more of 2).

[+] jayofdoom|9 years ago|reply
There is a third group -- which I would place myself in. I feel a strong responsibility to an environment I work in; but I also have a need for a disconnect and work/life balance. I absolutely am not ok with people in group #2, but I think group #1 is similarly unhealthy -- being constantly aware of issues can lead rapidly to burnout.

This is why you have to have an on-call rotation, with an SLA (i.e. all pages are ack'd within ten minutes) with enforcement for people who regularly miss pages, and keep the life-disruption to one or two team members. Obviously, anyone who's worked on a large software product knows you might get an escalation even if not on-call, but that's a hugely different workload than being attached to a pager and having to respond to them.

[+] CodeMage|9 years ago|reply
Every time I feel our industry has evolved beyond the "devs vs. managers" culture, something like this comes up to make me despair again.

I find it absolutely horrifying that you don't stop at the false dichotomy of diving every dev in those two groups, with no shades of grey in between, but actually go one step further and recommend firing everyone who isn't in the first group.

Yes, people care. Yes, they want to fix their own code. They also want to teach their little kids to ride a bike without training wheels. Your system might have problems every week, but that moment when your kid gives you a big grin and says "look, daddy, I'm doing it alone!", that happens once in that kid's life. So unless your system is controlling air traffic or supplying oxygen to hospital patients or something like that, you might want to consider that there's more to people's life than their work.

[+] Grangar|9 years ago|reply
You can't really expect everyone to be available on-call though. If the people in 1) want to fix their own code, why can't they do it more often?

Where I work we have a separate pager duty group, that anyone can join. Why would you? Because a) you care, and/or b) the compensation is great.

If you don't want to, that's great. No one will hold it against you. We are people after all.

[+] greyboy|9 years ago|reply
I'm not sure what industry this was in, but neither reflect my experience working in Telecom and HIPAA-related environments.

We couldn't push any fixes without full UAT/validation testing. No "install own paging systems" would be tolerated on national phone/video networks or anything HIPAA related. We wouldn've been walked out the door ASAP.

I think it really depends. We had critical SLA and support levels we had to respond to. If a backup was called and there wasn't a very specific reason the primary was not able to respond, there would be consequences.

I hated not being able to leave or do something spontaneously (especially on a weekend, for example), because we always had to be near. Of course, all that would change if they made it attractively worth my while. But almost nobody does. :)