(no title)
dub | 2 years ago
- Data migrations, schema changes, backfills, backups & restores, etc., take so long that they can either cause or risk outages or just waste a ton of engineer time waiting around for operations to complete. If you have serious service level objectives regarding time to restore from backup, that alone could be a forcing function for horizontal sharding (doing a point-in-time backup of a 40TB database while dropping some unwanted bad DELETE transaction or something like that from the transaction log is going to be very slow and cause a long outage).
- The lack of fault isolation means that any rogue user or process making expensive queries impacts performance and availability for all users, vs being able to limit unavailability to a single shard
- When people don't have horizontal scalability, I've seen them normalize things like not using transactions and not using consistent reads even when both would substantially improve developer and end-user experience, with the explanation being a need to protect the primary/write database. It's kind of like being in an abusive relationship: you internalize the fear of overloading the primary/write server and start to think it's normal not to be able to consistently read back the data you just wrote at scale or not to be able to use transactions that span multiple tables or seconds as appropriate.
dalyons|2 years ago
baobun|2 years ago
IME vertically scaled replicas/hot-stand-bys are a lot more stable to operate if your requirements allow you to get away with it. OtoH you better already be prepared if/when you hit scaling limits.