top | item 43360005

(no title)

CSDude | 11 months ago

For years, I just didn't get why replicated databases always stick with EBS and deal with its latency. Like, replication is already there, why not be brave and just go with local disks? At my previous orgs, where we ran Elasticsearch for temporary logs/metrics storage, I proposed we do exactly that since we didn't even have major reliability requirements. But I couldn't convince them back then, we ended up with even worse AWS Elasticsearch.

I get that local disks are finite, yeah, but I think the core/memory/disk ratio would be good enough for most use cases, no? There are plenty of local disk instances with different ratios as well, so I think a good balance could be found. You could even use local hard disk ones with 20TB+ disks for implementing hot/cold storage.

Big kudos to the PlanetScale team, they're like, finally doing what makes sense. I mean, even AWS themselves don't run Elasticsearch on local disks! Imagine running ClickHouse, Cassandra, all of that on local disks.

discuss

order

jiggawatts|11 months ago

I looked into this with an idea of running SQL Server Availability Groups on the Azure Las_v3 series VMs, which have terabytes of local SSD.

The main issue was that after a stop-start event, the disks are wiped. SQL Server can’t automatically handle this, even if the rest of the cluster is fine and there are available replicas. It won’t auto repair the node that got reset. The scripting and testing required to work around this would be unsupportable in production for all but the bravest and most competent orgs.

hodgesrm|11 months ago

There are a number of axes of performance that aren't covered in this [wonderful] article on storage performance. One of these is that EBS allows you to scale the VM up / down to change the amount of CPU & RAM available to process data on disk. We run several hundred ClickHouse clusters on this model. Rescaling to address performance issues is far more common than failures.

Example; you get a tenant performance issue on Sunday morning US time. The simplest fix is often rescale to a larger VM for the weekend, then get the A team working on the root cause first thing Monday. The incremental cost is minimal and avoids far more costly staff burnout.