top | item 43398095

(no title)

ekimekim | 11 months ago

When I'm in charge of an on-call rotation I always try to make it very clear that this is not the expectation.

In my preferred model of on-call, you have a primary, then after 5min an escalation to secondary, then after 5min an escalation to something drastic (sometimes "everyone", sometimes a manager).

The expectation is that most of the time you should be able to respond within 5 minutes, but if you can't then that's what the secondary role is for - to catch you. This means it's perfectly acceptable to go for a run, go to a movie, etc.

You relax the responsibility on the individual and let a sensible amount of redundancy solve the problem instead. Everyone is less stressed, and sure you get the occasional 5min delay in response but I'm willing to bet that the overall MTTR is lower since people are well rested and happier to be on call to begin with.

discuss

order

Anon1096|11 months ago

We have a primary/backup setup and I would be pretty pissed if my primary just started going out for movies or a date night during their shift tbh. My job as a backup is to be there for unexpected events, ie they did not wake up or had an accident. Not be on call effectively 2 weeks in a row just because the primary doesn't take it seriously.

jobs_throwaway|11 months ago

Yeah, going for a run or a dinner where you might be able to ack but not actually at keys for 10-20 minutes is one thing. Going to a movie or date where you might not even ack and won't be at keys for hours? Not cool at all.

notnaut|11 months ago

I don’t see how this changes the problem where there is an expected guarantee of a rapid response except that now two people are expected to be available and would now need to directly coordinate in order to ensure one person’s going for a swim doesn’t interfere with the other’s WoW raid.

closeparen|11 months ago

That's more or less what my team does. It works well. At least much better than saying you can't for for a swim at all.

hylaride|11 months ago

This is pretty much how it should be done. If the business demands more, they should have a properly manned 24x7 NOC.

You also need *ownership*. There is nothing worse than having to support somebody else's work and not being allowed (either via time or other restrictions) to do things "right" so that you're not always paged for fixable problems. Everywhere I worked where the techs had ownership (which varied from OPS people being allowed to override the backlog to fix issues or developers being given enough free reign to fix technical debt) has usually meant that oncall is barely an issue. My current gig I often forget I'm even on call at all and the main issues that do crop up are usually external.

happymellon|11 months ago

Almost all the reliability issues I encounter is usually due to constraints ordered by people who don't have to deal with on-call.

Things like, running in AWS but you have to use a custom K8S install so they aren't dependent on AWS.

Using self managed Kafka so that you aren't dependent on proprietary tech.

It all sucks because they are always less reliable and generate their own errors and noise for on-calls.

If they had to deal with phone calls every time there's a firewall issue that had absolutely nothing to do with the application, they would soon change their tune.

WhyIsItAlwaysHN|11 months ago

So it takes 10 min until you've gone to the drastic solution? With this time-frame it would be risky to go the bathroom, not go to a movie. Also even the backup sounds like a primary in this scenario.

ekimekim|11 months ago

Sure, but the assumption here is that primary and backup (edit: probably, ie. they're not coordinating this) aren't going to the bathroom at the same time. It's also based on the idea that alerts are extremely rare to begin with. If you're expecting at least one page every rotation, that's way, way too often. Step one is to get alerts under control, step two is a sane on-call rotation.

andrewaylett|11 months ago

We want to ack within five minutes, and be at a laptop within 30. So long as I'm within mobile signal when the page goes off, it doesn't really matter what I'm doing — an ack is a button press on a push notification. And I can stay within 30 minutes of my laptop and an Internet connection by carrying said laptop and my phone (with "unlimited" data).

If the primary (paid) on-call doesn't catch the notification, the secondary (unpaid) will be paged. And so on, down a couple more steps, to a senior manager. There's no expectation that anyone other than the primary would actually be available to ack the alert.

smitelli|11 months ago

Having the primary/secondary rotation is arguably worse. In that model, from the perspective of any one participant, now they're on-call for two weeks each time around instead of one.

inetknght|11 months ago

> The expectation is that most of the time you should be able to respond within 5 minutes

That's an unreasonable expectation unless it's clearly said in writing and is billable hours.