top | item 45295973

(no title)

charleshn | 5 months ago

A few questions if the authors are around!

> Is hardware agnostic and uses TCP/IP to communicate.

So no RDMA? It's very hard to make effective use of modern NVMe drives bandwidth over TCP/IP.

> A logical shard is further split into five physical instances, one leader and four followers, in a typical distributed consensus setup. The distributed consensus engine is provided by a purpose-built Raft-like implementation, which we call LogsDB

Raft-like, so not Raft, a custom algorithm? Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

> Read/write access to the block service is provided using a simple TCP API currently implemented by a Go process. This process is hardware agnostic and uses the Go standard library to read and write blocks to a conventional local file system. We originally planned to rewrite the Go process in C++, and possibly write to block devices directly, but the idiomatic Go implementation has proven performant enough for our needs so far.

The document mentions it's designed to reach TB/s though. Which means that for an IO intensive workload, one would end up wasting a lot of drive bandwidth, and require a huge number of nodes.

Modern parallel filesystems can reach 80-90GB/s per node, using RDMA, DPDK etc.

> This is in contrast to protocols like NFS, whereby each connection is very stateful, holding resources such as open files, locks, and so on.

This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

No mention of the way this was developed and tested - does it use some formal methods, simulator, chaos engineering etc?

discuss

jleahy|5 months ago

> So no RDMA?

We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.

PRs welcome!

> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.

If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.

> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.

AtlasBarfed|5 months ago

Not to mention you simply want a large distributed system implemented in multiple clouds / on prems / use cases, with battle tested procedures on node failure, replacement, expansion, contraction, backup/restore, repair/verification, install guides, an error "zoo".

Not to mention a Jepsen test suite, detailed CAP tradeoff explanation, etc.

There's a reason those big DFS at the FAANGs aren't really implemented anywhere else: they NEED the original authors with a big, deeply experienced infrastructure/team in house.

menaerus|5 months ago

DeepSeek team, which is also an HFT shop, also implemented their DFS - https://github.com/deepseek-ai/3FS

Yoric|5 months ago

My memories are a bit sketchy, but isn't CAP worked around by the eventual consistency of Paxos/Raft/...?

foota|5 months ago

Out of curiosity, you seem knowledgeable here, is it possible to do NVME over RDMA in public cloud (e.g., on AWS)? I was recently looking into this and my conclusion was no, but I'd love to be wrong :)

stonogo|5 months ago

Amazon FSx for Lustre is the product. They do have information on DIY with the underlying tech: https://aws.amazon.com/blogs/hpc/scaling-a-read-intensive-lo...