top | item 45641477

(no title)

fairity | 4 months ago

As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?

discuss

order

greybeard69|4 months ago

To my understanding the main problem is DynamoDB being down, and DynamoDB is what a lot of AWS services use for their eventing systems behind the scenes. So there's probably like 500 billion unprocessed events that'll need to get processed even when they get everything back online. It's gonna be a long one.

jewba|4 months ago

500 billions events. Always blows my mind how many people use aws

froobius|4 months ago

Yes, with no prior knowledge the mathematically correct estimate is:

time left = time so far

But as you note prior knowledge will enable a better guess.

matsemann|4 months ago

Yeah, the Copernican Principle.

> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.

> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.

https://www.newyorker.com/magazine/1999/07/12/how-to-predict...

tsimionescu|4 months ago

Note that this is equivalent to saying "there's no way to know". This guess doesn't give any insight, it's just the function that happens to minimize the total expected error for an unknowable duration.

Edit: I should add that, more specifically, this is a property of the uniform distribution, it applies to any event for which EndsAfter(t) is uniformly distributed over all t > 0.

movpasd|4 months ago

I used Claude to get the outage start and ends from the post-event summaries for major historical AWS outages: https://aws.amazon.com/premiumsupport/technology/pes/

The cumulative distribution actually ends up pretty exponential which (I think) means that if you estimate the amount of time left in the outage as the mean of all outages that are longer than the current outage, you end up with a flat value that's around 8 hours, if I've done my maths right.

Not a statistician so I'm sure I've committed some statistical crimes there!

Unfortunately I can't find an easy way to upload images of the charts I've made right now, but you can tinker with my data:

    cause,outage_start,outage_duration,incident_duration
    Cell management system bug,2024-07-30T21:45:00.000000+0000,0.2861111111111111,1.4951388888888888
    Latent software defect,2023-06-13T18:49:00.000000+0000,0.08055555555555555,0.15833333333333333
    Automated scaling activity,2021-12-07T15:30:00.000000+0000,0.2861111111111111,0.3736111111111111
    Network device operating system bug,2021-09-01T22:30:00.000000+0000,0.2583333333333333,0.2583333333333333
    Thread count exceeded limit,2020-11-25T13:15:00.000000+0000,0.7138888888888889,0.7194444444444444
    Datacenter cooling system failure,2019-08-23T03:36:00.000000+0000,0.24583333333333332,0.24583333333333332
    Configuration error removed setting,2018-11-21T23:19:00.000000+0000,0.058333333333333334,0.058333333333333334
    Command input error,2017-02-28T17:37:00.000000+0000,0.17847222222222223,0.17847222222222223
    Utility power failure,2016-06-05T05:25:00.000000+0000,0.3993055555555555,0.3993055555555555
    Network disruption triggering bug,2015-09-20T09:19:00.000000+0000,0.20208333333333334,0.20208333333333334
    Transformer failure,2014-08-07T17:41:00.000000+0000,0.13055555555555556,3.4055555555555554
    Power loss to servers,2014-06-14T04:16:00.000000+0000,0.08333333333333333,0.17638888888888887
    Utility power loss,2013-12-18T06:05:00.000000+0000,0.07013888888888889,0.11388888888888889
    Maintenance process error,2012-12-24T20:24:00.000000+0000,0.8270833333333333,0.9868055555555555
    Memory leak in agent,2012-10-22T17:00:00.000000+0000,0.26041666666666663,0.4930555555555555
    Electrical storm causing failures,2012-06-30T02:24:00.000000+0000,0.20902777777777776,0.25416666666666665
    Network configuration change error,2011-04-21T07:47:00.000000+0000,1.4881944444444444,3.592361111111111

rwky|4 months ago

Generally expect issues for the rest of the day, AWS will recover slowly, then anyone that relies on AWS will recovery slowly. All the background jobs which are stuck will need processing.

jameshart|4 months ago

Rule of thumb is that the estimated remaining duration of an outage is equal to the current elapsed duration of the outage.

seydor|4 months ago

1440 min