top | item 43524866

(no title)

cullenking | 11 months ago

Enterprise server gear is pretty reliable, and you build your infra to be fully redundant. In our setup, no single machine failure will take us offline. I have 13 machines in a rack running a > 10mm ARR business, and haven't had any significant hardware failures. We have had occasional drive failures, but everything is a RAID1 at a minimum so they are a non issue.

We just replaced our top of rack firewall/proxies that were 11 years old and working just fine. We did it for power and reliability concerns, not because there was a problem. App servers get upgraded more often, but that's because of density and performance improvements.

What does cause a service blip fairly regularly is a single upstream ISP. I will have a second ISP into our rack shortly, which means that whole class of short outage will go away. It's really the only weak spot we've observed. That being said, we are in a nice datacenter that is a critical hub in the pacific northwest. I'm sure a budget datacenter will have a different class of reliability problems that I am not familiar with.

But again, an occasional 15m outage is really not a big deal business wise. Unless you are running a banking service or something, no one cares when something happens for 15m. Heck, all my banks regularly have "maintenance" outages that are unpredictable. I promise, no one relaly cares about five nines of reliability in the strong majority of services.

discuss

Axsuul|11 months ago

Sounds great. Yep, what I mean is you will need to make your systems fully redundant before considering cloud if your business depends on reliability and uptime. That usually requires the business to reach a certain scale first.

sgarland|11 months ago

Sure, but making something redundant is not really that difficult. HAProxy in front N nodes across M racks, ideally in separate DCs, and then a floating IP in front of your HAProxies. Set up hot standby for your DB.

I used to joke that my homelab almost had better reliability than any company I’d been at, save for my ISP’s spotty availability. Now that I have a failover WAN, it literally is more reliable. In the five years of running a rack, I’ve had precisely one catastrophic hardware failure (mobo died on a Supermicro). Even then, I had a standby node, so it was more of an annoyance (the standby ran hotter and louder) than anything.