top | item 31830663

(no title)

mikewang | 3 years ago

I read the blog twice and have some thoughts: The root cause seems is as: "While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes."

And a dry-run: "a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure."

And a Peer review: "Before it was allowed to go out, it was also peer reviewed by multiple engineers. "

I would doubt the expertise of tech guys of cloudflare, reviewing the change. And there was a dry-run.

But is it really OK to apply the change to a spine network which would affect 50% network traffic? Just out of peer review and a dry run? No green/blue, no gray release, maybe these are not proper for a small change here. But this "small" change really got big affect. I thougt it was worth it.

And from my shallow experience, the dry-run would always have do nothing to the env. It is dry-run anyway.

And at last the three lines are found out. So I wonder how did this re-order happen? And why?

With these tiny changes, there should be some mechanism to verify their correctness, not just review and dry-run.

discuss

matsur|3 years ago

We use a phased rollout process for all routine changes (like this one). Once a change has passed peer review and the "dry-run", changes are rolled out to progressively larger slices of our production environment, with monitoring systems and engineers watching for adverse effects.

The specific network locations that were impacted by this change were amongst the last to see the change rolled out. One deficiency in our deployment strategy (which we will correct) is that no network locations in the affected "MCP" configuration received the change early in our rollout process. If that had been the case, we would have found the problem much earlier and the incident's impact would have been much reduced.