top | item 32477621

(no title)

mikesun | 3 years ago

> Is it really a problem to bubble up that error (and kill the database server) if you can just bring up a new database server with a clean cache (potentially even on the same computer without rebooting it) instantly?

Our estimations for MTTF for our larger clusters would mean there'd be a risk of simultaneous nodes stopping due to bad sector reads. Remediation in that case would basically require cleaning and rewarming the cache, which for large data sets could be on the order of an hour or more, which would mean we'd lose quorum availability during that time.

> to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.) In Scylla/Cassandra, you need to run full repairs that scan over all of the data. Having an LRU cache doesn't work well with this.

discuss

No comments yet.