top | item 20423158

(no title)

greenleafjacob | 6 years ago

If rollbacks are not safe then you have a change management problem.

If you have a good CM system, you should have a timeline of changes that you can correlate against incidents. Most incidents are caused by changes, so you can narrow down most incidents to a handful of changes.

Then the question is, if you have a handful of changes that you could roll back, and rollbacks are risk free, then does it make sense to delay rolling back any particular change until the root cause is understood?

discuss

order

Silhouette|6 years ago

It's not always as simple as that. What if the problem was that something in a change didn't behave as specified and wound up writing important data in an incorrect but retrievable format? Rolling back might not recognise that data properly and could end up either modifying it further so the true data could no longer be retrieved or causing data loss elsewhere as a consequence.

greenleafjacob|6 years ago

In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.

There are certainly changes that cannot be rolled back such that the affected users are magically fixed, which is not what I am suggesting. In the context of mission critical systems, mitigation is usually strongly preferred. For example, the Google SRE book says the following:

> Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct!

> Instead, your course of action should be to make the system work as well as it can under the circumstances. This may entail emergency options, such as diverting traffic from a broken cluster to others that are still working, dropping traffic wholesale to prevent a cascading failure, or disabling subsystems to lighten the load. Stopping the bleeding should be your first priority; you aren’t helping your users if the system dies while you’re root-causing. [...] The highest priority is to resolve the issue at hand quickly.”

I have seen too many incidents (one in the last 2 days in fact) that were prolonged because people dismissed blindly rolling back changes, merely because they thought the changes were not the root cause.

jdhendrickson|6 years ago

I think the 80 / 20 rule applies here.

sb8244|6 years ago

Because people make mistakes. Mistakes get fixed in post mortems, retros, best practices, etc. But mistakes will still happen.