top | item 34670085

(no title)

pifm_guy | 3 years ago

Obviously you do everything possible to stop an outage like this happening...

But when it inevitably does, you should be prepared for a full system simultaneous restart. Ie. So that no 'bad' signals or data from the old system can impact the new.

That is the sort of thing you should practice in the staging environment from time to time, just for when it might be needed. It could have taken this outage from many hours down to just many minutes.

discuss

pifm_guy|3 years ago

You should also design all your code to be rollbackable... But for the very rare case that a rollback won't solve the problem (eg. An outage is caused by changes outside your organisation's control), you also need to be able to do a rapid code change, recompile and push. Many companies aren't able to do this for example their release process involves multiple days worth of interlocked manual steps.

Don't get yourself in that position.