top | item 40014566

(no title)

lucacasonato | 1 year ago

The SLO (objective) is an uptime of 100%. That means that we have no error budget to use for scheduled maintenance or anything of that sort. This means that we can not use software in this hot path that would require scheduled maintenance (ie a relational database that requires periodic downtime for major version upgrades). We additionally minimize risk here: no code that is written by us sits in the path that targets 100% uptime. Ie if it breaks, its due to an upstream failure within Google's web serving infrastructure.

If we were to provide an SLA (an agreement, stating the minimum level of service to a customer) for this service, it would not be 100%. It would be 99.99%. This is to avoid risk. But we can still have a higher internal target than the provided SLA.

If we have to make all changes in a way that requires that we do not even have 8 seconds of downtime a year (but 0 seconds of downtime), that significantly changes how you design a system and roll out changes.

TLDR: SLA != SLO

discuss

order

lionkor|1 year ago

Hi, that makes sense, thank you - I didnt realize that this was meant in terms of "we have to choose technologies that never ever have to have maintenance", that would have been a better way to put it. Thanks :)