top | item 18587110

(no title)

pommers | 7 years ago

My current teams on call is pretty taxing. Two weeks on, two weeks off (only myself and my tech lead in the team at the moment), but our alerting is pretty good.

The places it falls down are where we interface with other teams who aren't on call for their systems and for them a weekend long outage is "acceptable".

discuss

order

wikibob|7 years ago

This is not sustainable, you will burn out in the long run and could take an extended period of time to recover. You are risking your health.

I suggest you look at the on-call chapters in the SRE book, SRE Workbook, and Seeking SRE.

The solution is primarily to include the development team in the on-call rotation (you build it - you run it). This can be very hard to do politically.

michaelt|7 years ago

  The solution is primarily to include the
  development team in the on-call rotation
...and to have a development team that, at any given time, has 4-8 people experienced enough to support every system that team works on.

pommers|7 years ago

Fully aware that it is unsustainable. Being a small team, we chose to maximize the time between on call stints. We will revisit the decision early next year.

We're hiring people with on call being something that is part of the position they are taking.

As for the other teams, we're working on the politics to get them to support there systems, and looking at alternatives to using them if they don't.