top | item 34586972

(no title)

kemals | 3 years ago

This was a rather interesting event. In general, changing the IP address (even the loopback address) shouldn't have caused it from the BGP perspective. For example, if you were to change the IP address of BGP enabled router that has multiple BGP sessions, all other routers tore down the sessions to it, and withdrew the prefixes. BGP reconverge events take time. However, less than this took (90+ minutes and then a few more hours until __full__ recovery).

This seems like one of the events in which they changed IP on Route Reflector routers that were pretty busy, which would cause reconvergence and CPU spikes for all routers that it had sessions with. Also, there was a lot of volatility, as part of which re-advertisements were happening continuously. They also attempted rollback, which caused reverse operation, which triggered reconvergence. The other scenario is doing this change on the SDN controller, which affected all other routers.

More details: https://www.thousandeyes.com/blog/microsoft-outage-analysis-... https://www.thousandeyes.com/resources/na-microsoft-outage-a...

discuss

order

No comments yet.