I’ve written before on HN about when my employer hired several ex-FAANG people to manage all things cloud in our company.
Whenever there was an outage they would put up a fight against anyone wanting to update the status page to show the outage. They had so many excuses and reasons not to.
Eventually we figured out that they were planning to use the uptime figures for requesting raises and promos as they did at their FAANG employer, so anything that reduced that uptime number was to be avoided at all costs.
It's because if you automate it, something could/would happen to the little script that defines "uptime," and if that goes down, suddenly you're in violation of your SLA and all of your customers start demanding refunds/credits/etc. when everything is running fine.
Or let's say your load balancer croaks, triggering a "down" status, but it's 3am, so a single server is handling traffic just fine? In short, defining "down" in an automated way is just exposing internal tooling unnecessarily and generates more false positives than negatives.
Lastly, if you are allowed 45 minutes of downtime per year and it takes you an hour to manually update the status page, you just bought yourself an extra hour to figure out how to fix the problem before you have to start issuing refunds/credits.
markild|3 months ago
I don't like it.
Aurornis|3 months ago
Whenever there was an outage they would put up a fight against anyone wanting to update the status page to show the outage. They had so many excuses and reasons not to.
Eventually we figured out that they were planning to use the uptime figures for requesting raises and promos as they did at their FAANG employer, so anything that reduced that uptime number was to be avoided at all costs.
mvkel|3 months ago
Or let's say your load balancer croaks, triggering a "down" status, but it's 3am, so a single server is handling traffic just fine? In short, defining "down" in an automated way is just exposing internal tooling unnecessarily and generates more false positives than negatives.
Lastly, if you are allowed 45 minutes of downtime per year and it takes you an hour to manually update the status page, you just bought yourself an extra hour to figure out how to fix the problem before you have to start issuing refunds/credits.
skywhopper|3 months ago
bnjm|3 months ago
mrgoldenbrown|3 months ago
agos|3 months ago
webdoodle|3 months ago