My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures. One server is the simplest and most reliable setup, and if you have backup and automated provisioning you can just re-deploy your entire environment in less than the time it takes to debug a complex multi-server setup.
I'm not saying everybody should do this. There are of-course a lot of services that can't afford even a minute of downtime. But there is also a lot of companies that would benefit from a simpler setup.
Yep. I know people will say, “it’s just a homelab,” but hear me out: I’ve ran positively ancient Dell R620s in a Proxmox cluster for years. At least five. Other than moving them from TX to NC, the cluster has had 100% uptime. When I’ve needed to do maintenance, I drop one at a time, and it maintains quorum, as expected. I’ll reiterate that this is on circa-2012 hardware.
In all those years, I’ve had precisely one actual hardware failure: a PSU went out. They’re redundant, so nothing happened, and I replaced it.
Servers are remarkably resilient.
EDIT: 100% uptime modulo power failure. I have a rack UPS, and a generator, but once I discovered the hard way that the UPS batteries couldn’t hold a charge long enough to keep the rack up while I brought the generator online.
My single on-premise Exchange server is drastically more reliable than Microsoft's massive globally resilient whatever Exchange Online, and it costs me a couple hours of work on occasion. I probably have half their downtime, and most of mine is scheduled when nobody needs the server anyhow.
I'm not a better engineer, I just have drastically fewer failure modes.
A lot of this attitude comes from the bad old days of 90s and early 2000s spinning disk. Those things failed a lot. It made everyone think you are going to have constant outages if you don’t cluster everything.
Today’s systems don’t fail nearly as often if you use high quality stuff and don’t beat the absolute hell out of SSD. Another trick is to overprovision SSD to allow wear leveling to work better and reduce overall write load.
Do that and a typical box will run years and years with no issues.
> My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures.
I think you misread OP. "Single point of failure" doesn't mean the only failure modes are hardware failures. It means that if something happens to your nodes whether it's hardware failure or power outage or someone stumbling on your power/network cable, or even having a single service crashing, this means you have a major outage on your hands.
These types of outages are trivially avoided with a basic understanding of well-architected frameworks, which explicitly address the risk represented by single points of failure.
In my experience, my personal services have gone down exactly zero times. Actually not entirely true, but every time they stopped working the servers had simply run out of disk space.
The number of production incidents on our corporate mishmash of lambda, ecs, rds, fargate, ec2, eks etc? It’s a good week when something doesn’t go wrong. Somehow the logging setup is better on the personal stuff too.
I also have seem the opposite somewhat frenquently: some team screws up the server and unrelated stable services that are running since forever (on the same server) are now affected due messing up the environment.
Puts|6 months ago
I'm not saying everybody should do this. There are of-course a lot of services that can't afford even a minute of downtime. But there is also a lot of companies that would benefit from a simpler setup.
sgarland|6 months ago
In all those years, I’ve had precisely one actual hardware failure: a PSU went out. They’re redundant, so nothing happened, and I replaced it.
Servers are remarkably resilient.
EDIT: 100% uptime modulo power failure. I have a rack UPS, and a generator, but once I discovered the hard way that the UPS batteries couldn’t hold a charge long enough to keep the rack up while I brought the generator online.
ocdtrekkie|6 months ago
I'm not a better engineer, I just have drastically fewer failure modes.
api|6 months ago
Today’s systems don’t fail nearly as often if you use high quality stuff and don’t beat the absolute hell out of SSD. Another trick is to overprovision SSD to allow wear leveling to work better and reduce overall write load.
Do that and a typical box will run years and years with no issues.
motorest|6 months ago
I think you misread OP. "Single point of failure" doesn't mean the only failure modes are hardware failures. It means that if something happens to your nodes whether it's hardware failure or power outage or someone stumbling on your power/network cable, or even having a single service crashing, this means you have a major outage on your hands.
These types of outages are trivially avoided with a basic understanding of well-architected frameworks, which explicitly address the risk represented by single points of failure.
Aeolun|6 months ago
The number of production incidents on our corporate mishmash of lambda, ecs, rds, fargate, ec2, eks etc? It’s a good week when something doesn’t go wrong. Somehow the logging setup is better on the personal stuff too.
talles|6 months ago
jeffrallen|6 months ago
Sigh.
ies7|6 months ago
joek1301|6 months ago
lelanthran|6 months ago
Is that more, less than or about the same as having an AWS/Azure/GCP consultant?
What's the difference in labour per hour?
> the risk of having such single point of failure.
At the prices they charge I can have two hot failovers in two other datacenter and still come out ahead.
wmf|6 months ago
chrisweekly|6 months ago
juped|6 months ago
justmarc|6 months ago