top | item 38864535

(no title)

jwestbury | 2 years ago

One of the principal engineers I used to work with at AWS had a saying: "A one-year certificate expiration is an outage you schedule a year in advance." Of course, it's a bit hyperbolic -- but a ten-year expiration is almost a certainty to result in an outage.

In a similar vein, you should never generate resources which will expire unless some undocumented action is taken. A common one I've seen is self-signed certs which last for n days, and are re-generated whenever an application is deployed or restarted, under the assumption that the application will never run untouched longer than that. (Spoiler: It probably will, at some point, whether due to unexpected change freezes, going into maintenance mode, or -- in my personal favourite -- being deployed to an environment that just isn't updated as regularly.)

discuss

order

Twirrim|2 years ago

That Principal Engineer's knowledge came from painful repeated experiences in AWS. When I left AWS in 2016 they were trying to push towards 3 monthly cert rotations, and hoping to get it shorter.

A year long expiry isn't frequent enough that you build automation, and is long enough that the runbook you have is likely out of date before the next time you execute it. If you make it 3 monthly, it's more likely to be fully or mostly automated, and it's more likely you'll remember that certs were recently introduced in a particular service. If you make it monthly, it's pretty much guaranteed that it'll be fully automated.

Almost every week in the weekly AWS-wide ops meetings, one service or another would be talking about something that went wrong that was caused by some certificate expiring, that happened in a place they'd forgotten they had certificates, or had missed when they did the rotation. A number of those failures presented in particularly misleading ways, too, by nature of what role the cert was playing.

dataflow|2 years ago

Does one actually manage to avoid such outages for 10 years by making the problem recur every month? 'cause I feel like stuff would still break even if you test and run them regularly.

MichaelZuo|2 years ago

Sounds like they need a systems that actually gets remembered and referenced if they want to stick to 1 year expiries.

tjoff|2 years ago

One day I could not connect to my (home) server. Turns out the client certificate had expired, I never thought to make note of or increase the 10 year default value when I did my test configuration...

m463|2 years ago

I remember there being a weird clock rollover bug that only financial firms would hit (since they never took their machines down, ever)

That was a long time ago. I wonder if technology/the cloud has changed or they still run those same machines

bluGill|2 years ago

30 years ago companies were rebooting their mainframes twice a year just to make sure. Before doing that companies were burned because the mainframe went down accidentally (backup generator broke during a power outage) and they couldn't get it to start because someone changed a setting at runtime but didn't save the setting to the boot scripts - then that person retired or found a new job. By rebooting twice a year they were able to ensure the someone remembered what setting was changed when the system failed to start.

chatmasta|2 years ago

Financial firms will also hit time-based bugs before most organizations because they often deal with forecasting events 30+ years in the future (e.g. mortgages). For a bank, the 2038 rollover has been relevant since 2008.

gorkish|2 years ago

I hit one of these on an EMC VNX array one time; after ~400 days all the controllers crashed at the same time. Didn't help that it happened at 4am on New Year's Day. I do recall other instances of this class of bug, but nothing specific.

nitwit005|2 years ago

I had to do a release to fix an outage because someone set up a system that would have an outage every six months if no one ran a release.

Naturally, they didn't document this.