(no title)
charleshn | 5 months ago
> Is hardware agnostic and uses TCP/IP to communicate.
So no RDMA? It's very hard to make effective use of modern NVMe drives bandwidth over TCP/IP.
> A logical shard is further split into five physical instances, one leader and four followers, in a typical distributed consensus setup. The distributed consensus engine is provided by a purpose-built Raft-like implementation, which we call LogsDB
Raft-like, so not Raft, a custom algorithm? Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?
> Read/write access to the block service is provided using a simple TCP API currently implemented by a Go process. This process is hardware agnostic and uses the Go standard library to read and write blocks to a conventional local file system. We originally planned to rewrite the Go process in C++, and possibly write to block devices directly, but the idiomatic Go implementation has proven performant enough for our needs so far.
The document mentions it's designed to reach TB/s though. Which means that for an IO intensive workload, one would end up wasting a lot of drive bandwidth, and require a huge number of nodes.
Modern parallel filesystems can reach 80-90GB/s per node, using RDMA, DPDK etc.
> This is in contrast to protocols like NFS, whereby each connection is very stateful, holding resources such as open files, locks, and so on.
This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).
No mention of the way this was developed and tested - does it use some formal methods, simulator, chaos engineering etc?
jleahy|5 months ago
We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.
PRs welcome!
> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?
We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.
If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.
> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).
Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.
AtlasBarfed|5 months ago
Not to mention a Jepsen test suite, detailed CAP tradeoff explanation, etc.
There's a reason those big DFS at the FAANGs aren't really implemented anywhere else: they NEED the original authors with a big, deeply experienced infrastructure/team in house.
menaerus|5 months ago
Yoric|5 months ago
foota|5 months ago
stonogo|5 months ago