(no title)
Puts | 6 months ago
I'm not saying everybody should do this. There are of-course a lot of services that can't afford even a minute of downtime. But there is also a lot of companies that would benefit from a simpler setup.
Puts | 6 months ago
I'm not saying everybody should do this. There are of-course a lot of services that can't afford even a minute of downtime. But there is also a lot of companies that would benefit from a simpler setup.
sgarland|6 months ago
In all those years, I’ve had precisely one actual hardware failure: a PSU went out. They’re redundant, so nothing happened, and I replaced it.
Servers are remarkably resilient.
EDIT: 100% uptime modulo power failure. I have a rack UPS, and a generator, but once I discovered the hard way that the UPS batteries couldn’t hold a charge long enough to keep the rack up while I brought the generator online.
whartung|6 months ago
We had a rack in data center, and we wanted to put local UPS on critical machines in the rack.
But the data center went on and on about their awesome power grid (shared with a fire station, so no administrative power loss), on site generators, etc., and wouldn't let us.
Sure enough, one day the entire rack went dark.
It was the power strip on the data centers rack that failed. All the backups grids in the world can't get through a dead power strip.
(FYI, family member lost their home due to a power strip, so, again, anecdotally, if you have any older power strips (5-7+ years) sitting under your desk at home, you may want to consider swapping it out for a new one.)
ocdtrekkie|6 months ago
I'm not a better engineer, I just have drastically fewer failure modes.
talles|6 months ago
api|6 months ago
Today’s systems don’t fail nearly as often if you use high quality stuff and don’t beat the absolute hell out of SSD. Another trick is to overprovision SSD to allow wear leveling to work better and reduce overall write load.
Do that and a typical box will run years and years with no issues.
motorest|6 months ago
I think you misread OP. "Single point of failure" doesn't mean the only failure modes are hardware failures. It means that if something happens to your nodes whether it's hardware failure or power outage or someone stumbling on your power/network cable, or even having a single service crashing, this means you have a major outage on your hands.
These types of outages are trivially avoided with a basic understanding of well-architected frameworks, which explicitly address the risk represented by single points of failure.
fogx|6 months ago
Aeolun|6 months ago
The number of production incidents on our corporate mishmash of lambda, ecs, rds, fargate, ec2, eks etc? It’s a good week when something doesn’t go wrong. Somehow the logging setup is better on the personal stuff too.
talles|6 months ago
jeffrallen|6 months ago
Sigh.
icedchai|6 months ago