top | item 41903339

(no title)

>Take the recent Lichess downtime, for example. Their main server had a hardware issue that required physical intervention. This meant the site was down for over 10 hours, and there wasn't much they could do except wait for OVH to send a tech.

If you not a HN person with systemadmin skills yes. But is NOT that hard to have in house RADI hd setup, with failover server. Or failover NAT gateway. AWS and cloud provider are just a rip off.

discuss

lelag|1 year ago

It is hard.

Lichess admins are highly skilled and I'm sure they already have a well designed infrastructure. You can see what they use at https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSk...

The issue was on a network equipment that they didn't even manage. You can't load balance when your core network is down. There was nothing they could do as I understand it.

More details at: https://lichess.org/@/Lichess/blog/post-mortem-of-our-longes...

lossolo|1 year ago

Their architecture is not fault-tolerant. If one server goes down and the whole system goes down, then it was not designed to be fault-tolerant.

I have been running fault-tolerant systems spread across multiple dedicated servers (inside system with multiple DB/KV stores distributed/replicated/sharded, Kafka etc). If one server experiences hardware failure, the system will automatically recover within seconds to minutes (depending on which server/part of service failed) without any data loss.

It's not that hard. You need the knowledge, but it's not rocket science.

olieidel|1 year ago

Even something as magical as a RAID won't make a technician instantly teleport to your server, power it down in zero seconds, swap out the hard drive and boot it back up in another zero seconds.

OPs comment is valid - physical servers might incur downtime.

But I do agree with your sentiment. "Downtime" is not an argument which should tilt the discussion towards either physical servers or the cloud. AWS data centers famously also have outages, while physical servers often have uptimes of multiple years. So what's better? It's hard to tell, but at the very least, none of these solutions is downtime-free.

rcarmo|1 year ago

No, but if you have backups and DR set up, most hyperscalers will just automatically move your workload someplace else upon failure within minutes (state management complexity notwithstanding—you need to architect for that).