(no title)
mikesun | 3 years ago
One of the reasons an LRU cache like dm-cache wasn't feasible was because we had a higher than acceptable bad sector read rate which would cause a cache like dm-cache to bubble up a block device error up to the database. The database would then shut itself down when it encountered an disk-level error.
> How does md handle a synchronous write in a heterogenous mirror? Does it wait for both devices to be written? Yes, md waits for both mirrors to be written.
> I'm also curious how this solution compares to allocating more ram to the servers, and either letting the database software use this for caching, or even creating a ramdisk and putting that in raid1 with the persistent storage. Since the SSDs are being treated as volatile anyways. I assume it would be prohibitively expensive to replicate the entire persistent store into main memory. Yeah, we're talking many terabytes.
> I'd also be interested to know how this compares with replacing the entire persistent disk / SSD system with zfs over a few SSDs (which would also allow snapshoting). Of course it is probably a huge feature to be able to have snapshots be integrated into your cloud... Would love if we could've used ZFS, but Scylla requires XFS.
saurik|3 years ago
mikesun|3 years ago
Our estimations for MTTF for our larger clusters would mean there'd be a risk of simultaneous nodes stopping due to bad sector reads. Remediation in that case would basically require cleaning and rewarming the cache, which for large data sets could be on the order of an hour or more, which would mean we'd lose quorum availability during that time.
> to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.) In Scylla/Cassandra, you need to run full repairs that scan over all of the data. Having an LRU cache doesn't work well with this.
baobob|3 years ago
Is it about the ability to dynamically add extra capacity over time or something?
ahepp|3 years ago
Is the bad sector read rate abnormally high? Are GCE's SSDs particularly error prone? Or is the failure rate typical, but a bad sector read is just incredibly expensive?
I assume you investigated using various RAID levels to make an LRU cache acceptably reliable?
It's also surprising to me that GCE doesn't provide a suitable out of the box storage solution. I thought a major benefit of the cloud is supposed to be not having to worry about things like device failures. I wonder what technical constraints are going on behind the scenes at GCE.
jhgg|3 years ago
Yes, incredibly error prone. Bad sector reads were observed at an alarming rate over a short period - well beyond what is expected if you were to just buy an enterprise nvme and slap it into a server.