top | item 45760818

(no title)

stefanha | 4 months ago

@graveland Which Linux interface was used for the userspace block driver (ublk, nbd, tcmu-runner, NVMe-over-TCP, etc)? Why did you choose it?

Also, were existing network or distributed file systems not suitable? This use case sounds like Ceph might fit, for example.

discuss

graveland|4 months ago

There's some secret sauce there I don't know if I'm allowed to talk about yet, so I'll just address the existing tech that we didn't use: most things either didn't have a good enough license, cost too much, would take a TON of ramp-up and expertise we don't currently have to manage and maintain, but generally speaking, our stuff allows us to fully control it.

Entirely programmable storage so far has allowed us to try a few different things to try and make things efficient and give us the features we want. We've been able to try different dedup methods, copy-on-write styles, different compression methods and types, different sharding strategies... All just as a start. We can easily and quickly create a new experimental storage backends and see exactly how pg performs with it side-by-side with other backends.

We're a kubernetes shop, and we have our own CSI plugin, so we can also transparently run a pg HA pair with one pg server using EBS and the other running in our new storage layer, and easily bounce between storage types with nothing but a switchover event.

yencabulator|3 months ago

> would take a TON of ramp-up and expertise we don't currently have to manage and maintain

But you think you have resources to maintain a distributed strongly-consistent replicating block store?

The edge cases in RDB are literally why Ceph takes expertise to manage! Things like failure while recovering from failure while trying to maintain performance are inherently tricky.

adsharma|4 months ago

Ceph is under LGPL.Cost doesn't seem to be a barrier. Supports k8s through CSI and has observability and documentation.

You can probably hire people to maintain it.

Was it the ramp-up cost or expertise?

kjetijor|4 months ago

I was struck by how similar this seems to Ceph/RADOS/RBD. I.e. how they implemented snapshotted block storage on top, sounds more or less exactly the same as how RBD is implemented on top of RADOS in ceph.

adsharma|4 months ago

One of the problems with Ceph is that it doesn't operate at the highest possible throughput or the lowest possible latency point.

DAOS seemed promising a couple of years ago. But in terms of popularity it seems to be stuck. No Ubuntu packages, no wide spread deployment, Optane got killed.

Yet the NVMe + metadata approach seemed promising.

Would love to see more databases fork it to do what you need from it.

Or if folks have looked at it and decided not to do it, an analysis of why would be super interesting.

adsharma|4 months ago

https://www.epcc.ed.ac.uk/whats-happening/articles/exploring...