top | item 37897921

Ask HN: Why are there no open source NVMe-native key value stores in 2023?

99 points| nphase | 2 years ago

Hi HN, NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don't seem to be any open source, non-embedded key value stores or DBs out in the wild yet.

Why do you think that is? Are there possibly other projects out there that I'm not familiar with?

68 comments

order
[+] diggan|2 years ago|reply
I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat:

- https://github.com/DataManagementLab/ScaleStore - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA"

- https://github.com/unum-cloud/udisk (https://github.com/unum-cloud/ustore) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel."

- https://github.com/capsuleman/ssd-nvme-database - "Columnar database on SSD NVMe"

[+] ashvardanian|2 years ago|reply
Hey, thanks for the mention! UDisk, however, hasn't been open-sourced yet. Still considering it :)
[+] geek_at|2 years ago|reply
you could also configure Redis to transact everything to disk and choose nvme as the target
[+] formerly_proven|2 years ago|reply
There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere, my guess is Samsung looked at this for some hyperscaler, whipped up a prototype in their customer-specific firmware and the benefits were lesser than expected so it's dead now)

[1] These slides claim up to 32 bytes, which would be a practically useful length: https://www.snia.org/sites/default/files/ESF/Key-Value-Stora... but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va...

[+] londons_explore|2 years ago|reply
Presumably there is some way to use the hash of the actual key as the key, and then store both key and value as data?

16 bytes is long enough that collisions will be super rare, and while you obviously need to write code to support that case, it should have no performance impact.

[+] londons_explore|2 years ago|reply
I think some devices built the block storage on top of the key-value store. Ie. when you write "hello..." (4k bytes) to address 123, it actually saves key: 123, value "hello...".

If so, that is probably the reason for a 16 byte key - there is just no way anybody needs a key bigger than 16 bytes for an address anytime soon.

[+] londons_explore|2 years ago|reply
I could imagine that if this mode isn't widely used, drive manufacturers haven't given much thought to performance, and it therefore might suck.
[+] jiggawatts|2 years ago|reply
Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.

The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.

Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...

[+] rwmj|2 years ago|reply
Is there not any danger of tenants rewriting the firmware on these drives, and surprising (or compromising) future tenants? AIUI this is the central reason why even "baremetal" cloud instances still have a minimal hypervisor between the tenant and the hardware.
[+] nerpderp82|2 years ago|reply
Eatonphil posted a link to this paper https://web.archive.org/web/20230624195551/https://www.vldb.... a couple hours after this post (zero comments [0])

> NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.

[0] https://news.ycombinator.com/item?id=37899886

[+] bestouff|2 years ago|reply
Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?
[+] chaos_emergent|2 years ago|reply
I believe that NVMe uses multiple I/O queues compared to serialized access with SATA and I think you’d be able to side unnecessary abstractions like file systems and block-based access with an NVMe-specific datastore.

I’m also curious if different and more performant data structures can leveraged; if so, there may be downstream improvements for garbage collection, retrieval, and request parallelism.

[+] creshal|2 years ago|reply
Latency ought to be much better, since you're skipping several abstraction layers in the kernel.

But that's about it. And the latency is still worse than in-memory solutions.

Between that and the non-trivial effort needed to make this work in any sort of cloud setup (be it self-hosted k8s or AWS), it's a hard sell. If I really need latency above all, AWS gives me instances with 24TB RAM, and if I don't… why not just use existing kv-stores and accept the couple of ns extra latency?

[+] threeseed|2 years ago|reply
Significant gains if you want a distributed key-value database because you can take advantage of NVMEoF.
[+] di4na|2 years ago|reply
Yes, mostly on the durability side. NVMe actually has the relevant API to be sure that a write was flushed, while posix like filesystem API usually do not handle it.
[+] delfinom|2 years ago|reply
https://github.com/OpenMPDK/KVRocks

Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

[+] londons_explore|2 years ago|reply
NVME's allow namespaces to be made - effectively letting multiple users all share an NVME device without interfering with each other.
[+] otterley|2 years ago|reply
Because you haven't written it yet!
[+] infamouscow|2 years ago|reply
I work on a database that is a KV-store if you squint enough and we're taking advantage of NVMe.

One thing they don't tell you about NVMe is you'll end up bottlenecked on CPU and memory bandwidth if you do it right. The problem is after eliminating all of the speed bumps in your IO pathway, you have a vertical performance mountain face to climb. People are just starting to run into these problems, so it's hard to say what the future holds. It's all very exciting.

[+] caeril|2 years ago|reply
> non-embedded key value stores or DBs out in the wild yet

I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a SEVEN LAYER STACK OF ABSTRACTIONS.

You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.

You don't get to do both simultaneously.

Embedded is a feature for performance-aware software, not a bug.

[+] rubiquity|2 years ago|reply
I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.
[+] Already__Taken|2 years ago|reply
A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.
[+] zupa-hu|2 years ago|reply
Is there any performance gain over writing append-only data to a file?

I mean, using a merkle tree or something like that to make sense of the underlying data.

[+] dboreham|2 years ago|reply
Writing to append-only files is a terrible idea if you want to query quickly.

(yes it's fashionable, but it's still terrible for random read performance)

[+] znpy|2 years ago|reply
I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.