HyperDex: A Searchable Distributed Key-Value Store

[+] ericmoritz|14 years ago|reply

From the FAQ <http://hyperdex.org/faq/>:

"So, the CAP Theorem says that you can only have one of C, A, and P. Which are you sacrificing?

HyperDex is designed to operate within a single datacenter. The CAP Theorem holds only for asynchronous environments, and well-administered datacenters enable us to sidestep this tradeoff entirely."

I'd like to see how they pull that off when a node goes down. I guess in "well-administered" data centers, nodes don't go down.

Sounds like they're sacrificing "A" to me because they're doing synchronous replication.

[+] snewman|14 years ago|reply

It's a bit of a middle ground. Yes, the replication is synchronous, which impacts availability. However, the master can remove a failed replica from the chain fairly quickly. In principle, with proper tuning, a node failure would merely cause a brief hiccup. This would feel more like a period of increased latency than a full-blown outage. So there really needn't be much sacrifice of availability.

However, there's also a sacrifice of partition tolerance. If the master is unable to communicate with any replica, the system can't serve requests. Also, the master is implemented as a collection of Paxos nodes; if these nodes are partitioned from one another, the entire system would grind to a halt.

Since this is intended for intra-datacenter use, one could argue that a full network partition might be unlikely. (Depending on what sort of data center you hang out in.) But in CAP terms, it's possible, of course.

(I base all this on the value-dependent chaining paper cited below.)

[+] rescrv|14 years ago|reply

HyperDex uses value-dependent chaining, which offers fault tolerance properties similar to those provided by chain replication (http://www.cs.cornell.edu/home/rvr/papers/osdi04.pdf).

A single node failure will be recovered from quickly without issue. Multiple concurrent failures are handled the same as the single failure case, so long as our failure assumptions are not violated (e.g., every node in the datacenter fails simultaneously).

[+] tlb|14 years ago|reply

When a node dies, the master reconfigures all the servers and clients with a new topology excluding the failed node. "Operations which are interrupted by reconfiguration exhibit at-most-once semantics." So while the system is reconfiguring after a node failure, updates can be lost.

Time windows of "at most once semantics" mean the system has none of C, A, or P. Which doesn't mean it's not a good database for many purposes.

[+] reitblatt|14 years ago|reply

"I'd like to see how they pull that off when a node goes down. I guess in "well-administered" data centers, nodes don't go down."

They offer f-fault tolerance. They can have f nodes go down in a single "zone" and keep chugging as long as no more nodes in the same zone go down before the master reconfigures. Note that the f faults are per-zone, not per-system, so in fact many more than f nodes can be down in a single system without a problem.

But, more importantly, you seem to be confusing partition tolerance and fault tolerance. CAP is about partition tolerance: offering "CA" in the presence of arbitrary partitions. They offer a specific form of fault tolerance: "CA" in the presence of any failure or partition that affects less than f nodes.

[+] lobster_johnson|14 years ago|reply

> "So, the CAP Theorem says that you can only have one of C, A, and P. Which are you sacrificing?

Which is wrong. You can only have two of the three at the same time: CA, CP or AP.

If they get something as fundamental as this wrong, you have to wonder about the rest of the project.

[+] xijhing|14 years ago|reply

PACELC[1] helps distinguish just the kind of CAP sacrifice this makes.

[1] http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-an...

[+] jandrewrogers|14 years ago|reply

This is a pretty nice implementation of an old design concept that has merit for modern distributed problems. It is worth studying as a model for distributed systems.

I am aware of a couple commercial distributed systems that use it (not this implementation but the underlying algorithm). Several organizations seem to have reinvented it over the last five years.

Organizing data in this way was studied in the 1980s but was poorly suited to the computing systems of the time. Relatively little was written about it because it was viewed as a dead end and modern literature has all but forgotten about it. Most of the research was done by companies rather than academics. Designs like this have fallen into the blackhole of "if it is not on the Internet then it doesn't exist". Back when I was studying these models I found more crusty old patents related to these types of models than relevant papers on the Internet. It will be valuable to have some modern literature pertaining to these designs.

[+] stevelosh|14 years ago|reply

For those that want to take a quick look at the source code without physically cloning it, there's a GitHub link buried somewhere in the site (I forget where I found it): https://github.com/rescrv/HyperDex

Their Python client seems to be using Cython for extra speed: https://github.com/rescrv/HyperDex/blob/master/hyperclient/p...

[+] ypcx|14 years ago|reply

I'm looking for a good paper or a book detailing patterns or strategies of storing complex data structures and relations, or designing complex schemas on top of key/value stores, possibly with real-world use cases / success stories. A bonus would be the same on top of key/value/range(search) stores, just like HyperDex is. Any suggestions HN?

Technically, it should be even possible to implement SQL on top of a store like HyperDex (there's the Tenzing paper from Google). But I'm not looking specifically at SQL, but at more general scope.

[+] jdf|14 years ago|reply

It looks like they trade the ability to scan ranges of keys for the ability to get single objects via multiple attributes. The value-dependent stuff is also a neat way of solving the consistency issues with multiple node updates. Interesting stuff.

If you were to swap this with your Cassandra cluster, you'd be losing multi-datacenter replication. Although partitions within a data center are pretty rare, you'd also lose some availability there as well. However, Cassandra is usually hash-partitioned so it needs to do broadcast for a scan (AFAICT, it even needs to do a broadcast for a lookup on a single secondary attribute), so you'd probably gain quite a bit of performance with HyperDex.

I can't tell if it's possible to dynamically change the set of secondary attributes being indexed without rebuilding the entire data set. Or how value-chaining works with missing attributes.

Also, apparently consistency has some... gaps... when you search via a secondary attribute:

"The searches are not strongly consistent with concurrently modified objects because there is a small window of time during which a client may observe inconsistency."

[+] jfager|14 years ago|reply

I'm reading through the paper, and I'm curious if I understand the implications of key subspacing and value dependent chaining correctly - do all reads and writes for a given key get forced to a single node? I understand how the replication that's described allows for failover when the point-leader fails, but does it also allow for scaling writes and key lookups?

[+] rescrv|14 years ago|reply

Reads and writes for a given key always go to a single node. As you add more machines with the same workload, you are handling fewer keys per machine. The flip side is, if the number of keys per machine stays constant, more machines directly translates into the ability to handle more keys.

HyperDex scales linearly as you can see in our scalability graph.

[+] NOD507|14 years ago|reply

How does it compare to elasticsearch?

[+] k-a-r|14 years ago|reply

Elasticsearch is a Fulltext search solution not a K/V Database.

[+] xxqs|14 years ago|reply

Did you compare the performance with (local) BerkeleyDB?

Also, is there a locking mechanism?

I'm the author of Torrus (torrus.org), and BerkeleyDB stability and non-network nature are quite painful. But I'm relying on its speed, concurrent locking, and some ways to acquire an exclusive lock on a table. It would be interesting to offer an alternative backend for torrus.

[+] rescrv|14 years ago|reply

I'm not sure exactly what you're looking for, but Berkeley DB is embedded in your application on a single host, while HyperDex is distributed. It sounds like that's exactly what you want.

What do you use exclusive locks for?

[+] simonw|14 years ago|reply

Is it open source? I browses around the site and FAQ and didn't spot a mention of the license.

[+] jandrewrogers|14 years ago|reply

The code is open source. However, they applied for patents on the algorithm:

http://www.wipo.int/patentscope/search/en/WO2011137189

[+] rescrv|14 years ago|reply

Yes! We've released HyperDex under the 3-clause BSD license.

You can get binaries and source from the downloads page (http://hyperdex.org/download/).

[+] DEinspanjer|14 years ago|reply

I think maybe only certain versions of Ubuntu are supported? It won't install because I don't have a repo that provides libgoogle-glog0

[+] rescrv|14 years ago|reply

The packages are built for Ubuntu 11.10. What version of Ubuntu are you running?

[+] PanMan|14 years ago|reply

Since they compare it with redis, I wonder if this can handle data bigger than memory. The other properties seem nice :)

[+] rescrv|14 years ago|reply

We compared it to Redis as it is one of the many key-value/document-database systems that provide the high throughput and low latency properties that make it comparable to HyperDex. We are expanding the systems we benchmark against, aiming to compare against as many systems as possible.

HyperDex is designed for bigger-than-memory data.

[+] pixelcort|14 years ago|reply

This reminds me of http://xanadu.com/zigzag/ .

[+] rescrv|14 years ago|reply

HyperDex is different. ZigZag focuses on data visualization. HyperDex provides you the same key-value interface of other systems, while also providing an efficient search primitive.

Check out http://hyperdex.org/tutorial/ for examples of the Python API.

[+] jcapote|14 years ago|reply

Is this better than hadoop?

[+] xxqs|14 years ago|reply

the FAQ mentions that you target x86_64 architecture. Does the server work on i386?

also, I hope it's doing proper memory alignment and endiannes independence? (because Mongo sucks on that)

[+] rescrv|14 years ago|reply

We use many lock-free datastructures which rely heavily upon the x86_64 architecture. Further, the expanded virtual address space enables us to mmap everything.

All network traffic is packed, and in network byte order.

[+] nirvana|14 years ago|reply

If you're building a distributed KV stores, you should benchmark against other distributed KV stores. Mongo and Cassandra aren't really. But Riak is.

Plus, since its "distributed" here's the benchmark I'd like to see:

1. Set up a cluster of 8 nodes. Set data replication to 3. 2. Load 3TB of data into the cluster, across 1M documents (or some data set of that order) 3. Run your tests. Optimize each of the DBs for the best way to access them (Eg: link walking vs. map reduce on Riak if that's faster, or secondary indexes if that's faster, or Riak Search if that's faster-- there are many ways to search Riak.) 4. Throw out the results of #3. 5. Pull the plug on 2 nodes. EG: Shut down completely, no longer on the net work at all. Poof, gone. Pick the two nodes by rolling the dice. If you have SPFs, and a bad role of the dice would have brought the whole cluster down, remove the word "distributed" from your marketing. 6. Let the cluster sit for 30 minutes. 7. Run your benchmarks.

The benchmarks developed in #7 are the ones I want to see.

I have no clue how HyperDex would perform in this situation. It could kick Riak's butt (but then, small cluster performance is just one of the criteria that is important to me). I just wish people did benchmarks like this (though I know its a PITA to do it this way.)

[+] foobarbazetc|14 years ago|reply

Why would a competitor build this Riak benchmark? How are they supposed to know which Riak configuration performs best?

That's the point of YCSB. Each vendor can submit the optimal configuration for their system, and they all run the same benchmark.

At the end of the day, each vendor is going to publish benchmarks that show their system performing better than all others.

It's your job, not theirs, to verify those benchmarks for your particular work load.

Edit: Also, while Mongo wouldn't classify as a distributed store, Cassandra definitely would. In fact, it's more "distributed" than Riak, since you have to pay Basho for multiple DC support (unless that's in the open source version?), whereas Cassandra is completely free.

[+] j_baker|14 years ago|reply

What exactly makes Cassandra "not really" a KV store, while Riak "really" is a KV store? I'm calling No True Scotsman.

88 comments