(no title)
raman325 | 4 years ago
I wrote the operating procedures for performing a backhaul upgrade in preparation for the launch of LTE. The upgrade was being performed on our primary routers which all traffic traversed. The operation had been performed several times successfully, but after the last implementation before the issue, I was informed that a couple of the lines were wrong (they had extra arguments). I did a find and replace all, shot the document off to the next team, and went about my day. The technician who was doing the upgrade called me on the afternoon of the upgrade (it was a Sunday) and asked if any of the initial steps were disruptive as he wanted to get started on the prep work ahead of the maintenance window. I said no and he proceeded. Mind you, this was also during the World Cup so I was out at a bar with my friends and not paying attention to my phone. About two hours later I looked at my phone and had multiple missed calls and voicemails from him, so I immediately ran to the MTSO. Turns out my find and replace changed some commands - the end result was that instead of adding additional VLANs to in-use interfaces, it replaced all of them with the new ones, which weren't carrying any traffic.
The fix was simple, we had failover routers and at some point after he couldn't get in touch with me the technician reached out to one of my peers who quickly told him to failover. He should have known to do that but he was also panicking. I called my boss and told him what happened and that it was my fault (and I genuinely believed it was. I wrote the bad procedure AND I approved activities outside of the maintenance window), and he told me to be at the office with the technician first thing in the morning the next day.
I couldn't sleep that night, this was early on in my career and I thought it was over. But that meeting ended up being one of the most transformative moments of my career. My boss's boss, a director, quoted the FAA's ethos of how a good system should have checks and balances, and if a mistake happens, the system is at fault, not any individual. We walked through the play by play and identified multiple opportunities to both avoid the issue that arose and to resolve the issue more quickly. We came up with an action plan as to how to make sure our "system" was more mistake proof for next time, and that was that.
That meeting has always stuck with me, and I always remind myself of that conversation whenever things at work don't go as planned. There is almost always something in the system that can be improved to account for human error, which should be expected.
No comments yet.