top | item 20889115

Samsung Announces Key-Value SSD Prototype

441 points| nikhizzle | 6 years ago |anandtech.com | reply

237 comments

order
[+] skissane|6 years ago|reply
IBM mainframes had key-indexed hard disks back in the 1960s – CKD (and later ECKD). Each disk sector could have a key field at the start, and rather than addressing disk sectors by physical location, you could tell the hard disk "give me the data of sector with key 1234", and it would go search for that sector and return it to you. (I think, you still had to tell it what disk track to search on, and it just did a sequential scan of the disk track...)

A lot more primitive than this, of course, but funny how old ideas eventually become new again.

[+] kev009|6 years ago|reply
It's also interesting that nobody implements CKD on storage devices. This complexity runs on a massive POWER AIX box in IBM's storage systems, or in a similar commodity powerhouse running UNIX or an RTOS in similar arrays of the past couple decades.

You have a comparatively weak CPU (think ARM Cortex M series) on an SSD and drive companies have been notoriously bad at firmware development.

Frankly, I don't trust Samsung to implement a file system. I prefer to not even use their enterprise flash based on past trauma, but those are usually harder fails than the subtle fuckups they can pull off in an opaque FS.

[+] bklaasen|6 years ago|reply
In "Moving targets: Elliott-Automation and the dawn of the computer age in Britain" Simon Lavington has documented[1] a computer built by Elliott Brothers (later Elliott Automation, then subsumed into ICL) named OEDIPUS that used content-addressable hardware storage. It went into production at GCHQ in 1954.

The computer also gets a mention in his 2011 book "Alan Turing and his contemporaries".

http://news.bbc.co.uk/2/hi/technology/8490464.stm

[+] vilaca|6 years ago|reply
Thank you for sharing that. Reading material for tonight ;) I googled ECKD and it seems it is still in use and support by IBM (maybe through emulation).
[+] dfox|6 years ago|reply
This involves simply exposing how the spinning rust drive works internally to software.

On most magnetic drives there is no inherent relationship between the sector number and its physical location on the track and the drive simply waits for the sector with correct sector number in its header to come under the head.

[+] orf|6 years ago|reply
> but funny how old ideas eventually become new again.

I dislike this meme - things are generational, not cyclical.

[+] ocdtrekkie|6 years ago|reply
This continues the trend we see in processors, that in a post-Moore's law environment, rather than trying to push physical limits for performance improvements, we're branching out into hardware optimized for specific purposes. Very neat to see it on the storage side, and something I don't think anyone could fathom back in the spinning disk days.
[+] umvi|6 years ago|reply
Maybe slightly off-topic but I've always struggled to understand the appeal of key-value stores like Redis and DynamoDB. I tried using it once as a substitute for shared memory in an embedded device, but you lose a lot of information (like type) and it seems like you can't represent or query complex data structures like nested structs without serializing it out to some data format first and then storing that in in the store (but at that point it's starting to get slow/complex and you can't easily examine what's in the database and so it's easier to keep using shared memory).

But clearly I'm missing something because everybody loves them and uses them. Is this a technology I would only use when building websites at scale or something?

[+] oh-4-fucks-sake|6 years ago|reply
I was in your camp for a long time until it clicked while attending an AWS summit presentation by the head DynamoDB evangelist/wizard Rick Houlihan.

He said (paraphrasing) "a lot of people repeat this mantra that DynamoDB (and by extension, NoSQL in general) is flexible--which couldn't be further than the truth--it's not flexible, but it's extremely efficient."

He went on to elaborate that relational DBs are not going away, and if 1) your data will never be huge and 2) you just can't predict the query patterns in the future of your app--a relational DB is still the way to go.

However, if your data is both huge and you can spend the up-front careful planning of the queries you'll need to support and can afford the risk of future migration efforts that could be needed if your query needs outgrow your model--then DynamoDB is a slam dunk.

If you fall into that category, then you'll need to spend the time learning the "advanced patterns" of DynamoDB which include key overloading, adjacency-list pattern, materialized graph pattern, and hierarchical key pattern--and then compose those patterns into a custom "schema" (in the parlance of relational DBs).

I'm building an app right now and it took me about 30 hours to collectively enumerate the 40-odd queries I'll need to support and the 10 iterations of overhauls of my design. But boy, was it worth it, because the DB is going to easily be the cheapest m'fu* component of my app. Compare that to (taking an extreme alternative) paying for something like MS SQL server which is colossal money sink. Even compared to open-source like Postgres, my setup here will be probably < 1/10th the cost and if, for whatever reason, I see a drop in traffic or I need to shut down for a period, my DB isn't metering for compute--only storage.

In a drawn-out answer to your question of "would (I) only use (this) when building websites at scale or something?" the answer is generally "yep, probably".

[+] hinkley|6 years ago|reply
I'm not a big fan of KV stores either, but I accept that they exist because there's an origin story for these things that makes a lot of sense. External storage of shared state fixes problems for shared-nothing and shared-everything programming environments.

In Java it gives teeth to the "write once, read many" pattern. By avoiding side effects you can scale your team larger. But it's a constant drain on resources to police this convention. There's always someone who thinks their reason to violate the rules is valid. Pushing the data out of process increases the cost of modifying data that should be read-only, and violations advertise loudly. You can't achieve it by obfuscation, intentional or otherwise.

Practically though, at the time KV stores were coming into existence, you could buy hardware with so much memory that Java couldn't keep up. They hit a GC wall that was causing serious problems. And running multiprocess was just anathema. If you push all of the long-lived objects out to the KV store not only does memory drop like a rock but what's left over is 'young' objects, and GC often optimizes for short-lived objects. In a way it becomes a self-fulfilling prophecy. By making in-process caching expensive they made it unwelcome.

In this same timeframe, latency on network cards became lower than latency to your drive controller. Reading something out of memory on another machine in the same data center was faster than getting it off of your own disk.

Meanwhile in an environment where all tasks are pretty isolated from each other, in-process caching is ungainly. Either you suffer with a low hit ratio, you do some sort of bizarre traffic shaping, or you do caching at the ingress point so you don't make the requests at all. But that informs your engineering priorities to such an extent most people don't like. But I see it as one of those "uncomfortable" things that Continuous Delivery advises you to face head-on. Cache invalidation is hard, yes. Wear a helmet.

So the shared-nothing people also liked these tools, which means you have two groups that probably normally wouldn't talk to each other pulling for the same team.

[+] 013a|6 years ago|reply
There's a very easy to understand use-case in storing data where you don't really care about the type or querying capacity. Imagine you've got an API request coming in; your server generates the response, and then you cache that response in a TTL'd KV store where the key is some deterministic hash of the appropriate information from the request (method, path, body, requestor, etc). Future requests just use the cache for a bit.

Here, we don't care what gets stored there, because our API server isn't really "using" the data that gets returned; we just throw it back at the client. We also don't care about complex querying. That's where KV stores shine.

Its just a different pattern for different use-cases, at least in DB land. If you need complex querying, then KV probably isn't for you. But, if you need ridiculously fast lookups and perfect horizontal scalability, then you could equally say that SQL isn't for you, and KV is.

[+] nickdandakis|6 years ago|reply
Key-value stores satisfy a set of use-cases, as do NoSQL, and SQL.

Key-value is great when you have a key-value pair that you'd like to cache. Specifically in web workloads, maybe you'd like to cache the API response (value) to a certain API query (key) such that its returned without processing power from your backend. Or maybe user permissions per action, where they don't really change as they're reliant on the user's role.

I'm not entirely sure what a valid use-case would be for an embedded device, but if there's any kind of SQL query you would make on an embedded device that returns a response that hardly changes, you might also spin up some kind of key-value store such that it's returned without (well, faster than) SQL query time.

It probably doesn't make sense to completely replace a SQL database if you're trying to represent relational data.

[+] blaisio|6 years ago|reply
DynamoDB is useful for services that are likely to have a very large amount of load and require minimal maintenance. For example, Mozilla uses DynamoDB to back it's browser push notification sending service. It would be better if it supported SQL, like Google Spanner, but it's still useful.

Redis is useful because it can do so goddamn much. Caching, task queuing, hyperloglog, geospatial indexing. Set it up once, and it can replace a very large number of cloud services. And it has a lot of options for replication, backups, clustering, etc. It is like a Swiss army knife.

[+] JohnBooty|6 years ago|reply
You're not wrong about anything.

   Is this a technology I would only use when building websites at scale or something? 
Basically, though I'd amend that to read "websites that might need to scale at some point."

For a lot of projects, shared memory is a perfect choice. But if you might need to scale up to multiple instances at some point, shared memory obviously won't do. And since it's not like there's a huge penalty for using Redis, a lot of times it's just convenient to use it from the beginning.

[+] aledalgrande|6 years ago|reply
I find Redis is very useful when you need fast, temporary access to memory in a distributed system. I have used it several times as a distributed lock, or as storage for real-time computations in a distributed system that only use primitive values, for example aggregating page views into different clusters (hashes or sets in Redis) and then saving the result to DB.
[+] hmottestad|6 years ago|reply
The article mentions RocksDB, which is a great backend used in other databases. It's key value, but it can be used to build other databases, for instance graph databases.

Key-value can also be just part of a database. Like having a key-value index.

[+] henryfjordan|6 years ago|reply
For making redis a little more usable beyond pure key-value operations take a look at their tutorial on "Secondary Indexing": https://redis.io/topics/indexes

If you care about typing and complex querying though, you might be better off using SQLite or similar.

[+] adventured|6 years ago|reply
Stack Exchange as one example aggressively uses Redis as a caching layer for practically its entire service.

Redis is very fast, dependable and relatively easy to use. The alternative, other than a competing product, is pretty much that you have to write your own custom replacement Redis in some manner. That might be fine at a small scale for fun or in an experimental product that isn't meant to go into production; otherwise you want something well proven, that many other engineers you can hire are likely to have experience with (it's hard to over-emphasize that last point).

https://meta.stackexchange.com/questions/69164/does-stack-ex...

[+] jarfil|6 years ago|reply
A filesystem is a KV store; if you can solve a problem by storing data in files, then you can use a faster and more scalable KV solution.

I think it makes more sense to think of KV as a minimalistic distributed filesystem, rather than as a limited kind of memory or a stripped SQL database.

[+] markdoubleyou|6 years ago|reply
Redis lets you store and manipulate some basic data structures like linked lists, maps, and sets -- so a value in Redis doesn't necessarily have to be a big, serialized blob (though I bet that a lot of people use it this way, as if it was a drop-in replacement for Memcached).

https://redis.io/topics/data-types-intro

[+] notyourday|6 years ago|reply
Redis is not a key:value store. That's memcache. Redis a data structure store with native operations that can be performed on those data structures.
[+] cjblomqvist|6 years ago|reply
In addition to the many good answers, it's also used as a backend to SQL databases (like RocksDB which is mentioned in the article).
[+] _pmf_|6 years ago|reply
> but you lose a lot of information (like type)

"We have perfect support for both kinds of type: strings and and JSON blobs."

[+] cryptozeus|6 years ago|reply
How about for storing centralized cache ? Reddis has been very helpful for us in that.
[+] pkulak|6 years ago|reply
It's great when you need to share memory between pods, containers, machines, etc.
[+] mr_toad|6 years ago|reply
Values can be references, and you can query data structures with traversal.
[+] elevenbits|6 years ago|reply
Would be interesting if this evolves into a full filesystem implementation in hardware (they talk about Object Drive but aren't focused on that yet). Some interesting future possibilities:

- A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android etc. Imagine having a single disk that could boot any computer operating system without having to manage partitions and boot records!

- Significantly improved filesystem performance as it's implemented in hardware.

- Better guarantees of write flushing (as SSD can include RAM + tiny battery) that translate into higher level filesystem objects. You could say, writeFile(key, data, flush_full, completion) and receive a callback when the file is on disk. All independent of the OS or kernel version you're running on.

- Native async support is a huge win

Already the performance is looking insane. Would love to get away from the OS dictating filesystem choice and performance.

[+] jasonhansel|6 years ago|reply
This is huge. It's been obvious for a long time that the standard block-device abstraction isn't a great fit for SSDs. This development finally gives us a better abstraction that will immensely improve performance of a wide variety of applications (possibly even SQL databases).
[+] BubRoss|6 years ago|reply
Why isn't a block device a good fit for an SSD?
[+] geophile|6 years ago|reply
I don’t understand the comments saying that this obsoletes databases, or that this is a good substrate for relational databases.

The interface is for random access. Quite a few database optimizations depend on sequential access, I.e. accessing records in key order following a random access. This is why B-trees are so important. Sequential access in key order does not appear to be a possibility with this technology.

[+] jandrewrogers|6 years ago|reply
These have been built many times, and I have even been involved in the design of one (that was ultimately scrapped). The value proposition of these products is not what many people seem to be assuming, so I will elaborate. Under the hood, most implementations are just mildly modified LevelDB/RocksDB/etc running on an ARM processor.

In theory, there should be no performance advantages to embedding the storage engine this way. But there is -- back to that in a moment. Not only will a properly optimized storage engine run just as fast on the CPU but there are performance advantages to doing so for applications like databases. If you are designing a state-of-the-art storage engine for complex and high-performance storage work, these devices are not for you, you can always do better with bare storage.

The key phrase is "properly optimized". The I/O scheduling and management underlying a typical popular KV storage engine is actually quite far from properly optimized for modern storage hardware, with significantly adverse consequences for performance and durability. The extent to which this is true is significantly underestimated by many developers. From the perspective of the storage manufacturers, more and more theoretical performance of their product is being wasted because the most popular open source storage engines are incapable of taking advantage of it as a matter of architecture. The kind of architectural surgery required to address this is correctly seen as something you can't upstream to the main open source code base.

The software in these devices is typically an open source storage engine where they ripped out the I/O scheduler, storage management, and whatnot, replacing it with one properly optimized to take advantage of the hardware. This could be done in software but storage companies aren't in that business. Their hope is that people will use these devices instead of LevelDB etc, with the promise of superior performance that justifies higher cost.

In practice, these devices never seem to do well in the market. People that are using the KV stores these are intended to replace are the kind of people that do not have particularly performance-sensitive applications, and therefore won't pay a premium. And it adds no value, and has some significant disadvantages, for companies with serious storage engine implementation chops or software storage-engines that are well-optimized for this kind of hardware.

tl;dr: These are like in-memory databases. A simple way to improve the performance of applications instead of investing in hardcore software design and implementation but providing no other value.

[+] tenebrisalietum|6 years ago|reply
What's the advantage of this over a simple hardware interface (which should be really simple since I thought NVMe was basically just a PCIe node) that directly exposes the flash and then let the application/filesystem/whatever layer handle it?
[+] chadash|6 years ago|reply
Can someone tell me what the use case for this would be?
[+] ahupp|6 years ago|reply
This is a cool idea, and I'm excited to see what kind of perf wins it produces. It seems like the other solution would be to expose all the details of the SSD to the OS: wear leveling, GC etc, and make that a driver concern. Probably a lot harder to get right, but more debuggable and more opportunities to tune for your specific workload.
[+] wtallis|6 years ago|reply
That's what the Open Channel SSD concept is. It's been getting a lot of attention in the past few years, but it seems like many potential users still balk at the idea of having that thin of an abstraction layer. The open-channel stuff has been influencing the addition of other new features to the NVMe protocol that expose most of the information an open-channel SSD would, but don't break compatibility by requiring the host to use that information.

Several vendors are also supporting the Zoned Storage concept of making SSDs that have similar IO constraints to shingled magnetic recording (SMR) hard drives. Those constraints aren't a perfect match for the true characteristics of NAND flash memory, but it does handle the problem of large erase blocks.

[+] bvinc|6 years ago|reply
I understand how it can achieve better performance by bypassing the file system.

But I'd like to compare this to hypothetical key-value software that stores its data directly on a partition (instead of in files). Isn't this essentially the same thing? The only difference that I can see is that the software would be much harder to update, and you can offload some CPU on to the processor on the drive.

Am I looking at this correctly? I don't get why you would want this to be a hardware device.

[+] myself248|6 years ago|reply
As someone on the periphery of comp-sci my whole life, I find myself wondering, is this similar in concept to content-addressable memory?
[+] goldenkey|6 years ago|reply
No, because we are addressing it by key.
[+] int0x80|6 years ago|reply
How is this exposed to userspace? What interface?
[+] mpsys|6 years ago|reply
It is accessible directly to applications through the SNIA KVS API (and Samsung has it's own API as well). There is no filesystem in the middle, if you are using the KV-controller.
[+] perspective1|6 years ago|reply
On the one hand I think this is very cool. On the other, I think it's funny that Redis can fit in my RAM and not on my SSD.
[+] lucas_membrane|6 years ago|reply
But who will be the first to use a Key-Value SSD to implement a file system?
[+] DonHopkins|6 years ago|reply
How flexible are the limits on key and data size? I imagine a lot of apps would need more than just 255 byte keys and 2MB values. Is there an efficient way to virtually increase the value size, at least?
[+] Androider|6 years ago|reply
How would you do a backup of this KV store if it's not exposed as a filesystem or block device? You wouldn't (it's for ephemeral caching?), or you'd just walk the entire keyspace?
[+] zzzcpan|6 years ago|reply
There are iterators in the API I believe.