top | item 38505946

(no title)

exitheone | 2 years ago

That's still ridiculously slow. I'd expect them to have hundreds of Microservices. Each one of those should be able to handle a random restart at any point in time so they should absolutely be able to restart 100s of servers concurrently without major disruptions. Hell on Facebook scale a whole-Datacenter going down should not cause service disruptions.

discuss

Closi|2 years ago

This does assume that nothing is getting broken along the way.

Taking 45 days is probably more about caution and resolving issues systematically rather than pushing a big button and hoping you don’t cause issues.

I’d expect them to have thousands of microservices - and you only have to find a way to break one to cause big issues.

exitheone|2 years ago

Regular random crashes should be exercised regardless at Facebook scale. Not being resilient to that would be very unprofessional.