top | item 38743849

(no title)

dub | 2 years ago

Typically the price of not having horizontal scaling is felt more by the engineers than the users, at first:

- Data migrations, schema changes, backfills, backups & restores, etc., take so long that they can either cause or risk outages or just waste a ton of engineer time waiting around for operations to complete. If you have serious service level objectives regarding time to restore from backup, that alone could be a forcing function for horizontal sharding (doing a point-in-time backup of a 40TB database while dropping some unwanted bad DELETE transaction or something like that from the transaction log is going to be very slow and cause a long outage).

- The lack of fault isolation means that any rogue user or process making expensive queries impacts performance and availability for all users, vs being able to limit unavailability to a single shard

- When people don't have horizontal scalability, I've seen them normalize things like not using transactions and not using consistent reads even when both would substantially improve developer and end-user experience, with the explanation being a need to protect the primary/write database. It's kind of like being in an abusive relationship: you internalize the fear of overloading the primary/write server and start to think it's normal not to be able to consistently read back the data you just wrote at scale or not to be able to use transactions that span multiple tables or seconds as appropriate.

discuss

dalyons|2 years ago

To your first point I find in these discussions the “just buy a bigger server” crowd massively underestimate the operational problems with giant db servers. They are no fun to babysit, and change gets really hard and tedious to not accidentally bring the whole thing down. It becomes a massive drain on the velocity and agility of the business.

baobun|2 years ago

Giant sharded DB clusters aren't that much more fun or less precarious... Ever run Cassandra or Clickhouse at scale?

IME vertically scaled replicas/hot-stand-bys are a lot more stable to operate if your requirements allow you to get away with it. OtoH you better already be prepared if/when you hit scaling limits.