top | item 45811151

(no title)

anomaloustho | 3 months ago

It’s already been said, but most companies already have those instant “alarms” that go off within minutes. 80% of the time, those alarms are red herrings that get triaged. At a lot of companies, they go off constantly.

As a company, you don’t want to declare an outage readily and you definitely don’t want it to be declared frequently. Declaring an outage frequently means:

• Telling your exec team that your department is not running well • Negative signal to your investors • Bad reputation with your customers • Admitting culpability to your customers and partners (inviting lawsuits and refunds) • Telling your engineering leadership team that your specific team isn’t running well • Messing up your quarterly goals, bonuses etcetera for outages that aren’t real

So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real. You want to make sure you get it right. Therefore, you don’t just want to flip a status page because a few API calls had a timeout.

discuss

order

FinnKuhn|3 months ago

>So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real.

I would argue that every social and incentive structure along the way basically signals that you don't want to declare an outage, even when it is real. You should still do it though or it becomes meaningless.

Great example for Goodhart's law.

gwbas1c|3 months ago

Just wanted to chime in that, at my company, we have some policies that impact when we actually update our status page to show that we have an outage. Without going into details, the policies deliberately slow down our reporting of downtime: We (engineering) need to have a clear understanding of what the problem is before we say there is a problem publicly.

I've personally challenged some details in these policies, which I won't discuss publicly. What I generally agree with is that it's important to have a human in the loop, and to be very thoughtful about when to update a status page and what is put there.