top | item 46467221

(no title)

malwrar | 1 month ago

I’ve been toying around with the idea of using chaos engineering as a method of training new on-call folks. My first ever on-call shift was during a major product launch for a FAANG and I more or less just hoped that’d I’d be able to handle whatever broke. I got lucky and it turned out that I can usually fix things when they break, but have also found that jumping people in like that isn’t exactly consistent. I wonder if controlled, limited outages (maybe even as a surprise) would be a less hellish way of doing it. could be a good way to build instinct under pressure without risking too much.

discuss

order

AtlasBarfed|1 month ago

This sounds perilously close to hazing

malwrar|1 month ago

Can you expand on that?

Currently we do shadow shifts for a month or two first, but still eventually drop people into the deep end with whatever experience production gifts them in that time. That experience is almost certainly going to be a subset of the types of issues we see in a year, and the quantity isn’t predictable. Even if the shadowee drives the recovery, the shadow is still available for support & assurance. I don’t otherwise have a good solution for getting folks familiar with actually solving real-world problems with our systems, by themselves, under severe time pressure, and I was thinking controlled chaos could help bridge the gap.