RocksDB – A persistent key-value store for fast storage environments

[+] snewman|12 years ago|reply

Very nice work, and the wiki is also quite nice -- I wish more projects had a page like https://github.com/facebook/rocksdb/wiki/Rocksdb-Architectur.... It's really nice to see a clear, terse summary of what makes this project interesting relative to its predecessors.

At my company (scalyr.com), we've built a more-or-less clone of LevelDB in Java, with a similar goal of extracting more performance on high-powered servers (and better integration with our Java codebase). I'll be digging through rocksdb to see what ideas we might borrow. A few things we've implemented that might be interesting for rocksdb:

* The application can force segments to be split at specified keys. This is very helpful if you write a block of data all at once and then don't touch it for a long time. The initial memtable compaction places this data in its own segment and then we can push that segment down to the deepest level without ever compacting it again. It can also eliminate the need for bloom filters for many use cases, as you often wind up with only one segment overlapping a particular key range.

* The application can specify different compression schemes for different parts of the keyspace. This is useful if you are storing different kinds of data in the same database.

* We don't use timestamps anywhere other than the memtable. This puts some constraints on snapshot management, but streamlines get/scan operations and reduces file size for small values.

Do you have benchmarks for scan performance? This is an important area for us. I don't have exact figures handy, but we get something like 2GB/second (using 8 threads) on an EC2 h1.4xlarge, uncached (reading from SSD) and decompressing on the fly. This is an area we've focused on.

I'd enjoy getting together to compare notes -- send me an e-mail if you're interested. steve @ (the domain mentioned above).

[+] hyc_symas|12 years ago|reply

SkyDB using LMDB gets 3GB/sec on a standalone PC. https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1...

[+] dhruba_b|12 years ago|reply

Thanks for your comments Steve.

1. RocksDb has a feature that allows an application to determine when to close a file (i.e. segment). You can write your compaction code via compaction_filter_factory defined in https://github.com/facebook/rocksdb/blob/master/include/rock...

2. RocksDb also has a feature that allows an application to close a block inside a segment. https://github.com/facebook/rocksdb/commit/fd075d6edd68ddbc1...

3. RocksDb has a feature to use different compression algorithms for different parts of the database. In the Level Style Compaction, you can configure a different compression algorithm for different levels. In Universal Style Compaction, you can specify that you want compression only for x% earliest data in your database.

4. We have internal benchmarks for scan performance but because of lack of developer resources, we might not be able to open source those numbers.

It will be great to catch up in person.

[+] rsynnott|12 years ago|reply

> we've built a more-or-less clone of LevelDB in Java, with a similar goal of extracting more performance on high-powered servers (and better integration with our Java codebase).

This sounds quite interesting; have you considered open-sourcing it?

[+] dhruba_b|12 years ago|reply

Hi guys, I am Dhruba and I work in the Database Engineering team at Facebook. We just released RocksDB as an open source project. If anybody has any technical questions about RocksDB, please feel free to ask. Thanks.

[+] jbapple|12 years ago|reply

Hi Dhruba, thanks for volunteering to ask questions.

What are the big algorithmic ideas behind RocksDB?

My understanding is that LevelDB is based on log structured merge trees. These can be deamortized using methods from Overmars's "The Design of Dynamic Data Structures" or Bender et al.'s "Cache-Oblivious Streaming B-trees". How did you reduce latency?

What else was slowing down databases larger than RAM? How did you fix that?

[+] hendzen|12 years ago|reply

How much do you think RocksDB/LevelDB performance is impacted by the use of relatively coarse-grained locking? Another LevelDB fork, HyperLevelDB [1] implemented a fine grained scheme with performance benefits. Disclaimer: I am working on a (unreleased) fork of LevelDB that uses hardware transactional memory for synchronization using the new TSX-NI instructions present on Haswell processors.

[1] - http://hyperdex.org/performance/leveldb/

[+] techtalsky|12 years ago|reply

Can this be used as a drop-in replacement for LevelDB on queue technologies like ApolloMQ that use LevelDB as the default?

[+] droid5|12 years ago|reply

https://github.com/facebook/rocksdb/issues/new is a 404.

Appreciated when project is on github and open for issues/PR, etc.

[+] mml|12 years ago|reply

Any replication support? (Or any sort of distribution?)

[+] Patient0|12 years ago|reply

I'm surprised that the C++ code is not using the RAII idiom in some obvious places.

For example: https://github.com/facebook/rocksdb/blob/master/db/db_impl.c

There are many places with bracketed calls to mutex_.Lock and mutex_.Unlock().

An example:

      mutex_.Unlock();
      LogFlush(options_.info_log);
      env_->SleepForMicroseconds(1000000);
      mutex_.Lock()

Why didn't the authors use the RAII idiom here? Even if there are no exceptions expected, the code would still be simpler and less error prone by using a guard object.

[+] tsewlliw|12 years ago|reply

fixed your link: https://github.com/facebook/rocksdb/blob/master/db/db_impl.c...

Take another look! There's a guard object used at the function scope to ensure the lock is released, and this block is bracketed to release and reacquire the lock, not acquire and release. There may be a case for a guard object that does the release/reacquire, but its definitely not a slam dunk like acquire/release

[+] e12e|12 years ago|reply

That link should be:

https://github.com/facebook/rocksdb/blob/master/db/db_impl.c...

(not db_impl.c but .cc) I was wondering for a while if it was possible to do RAII in idiomatic c99 -- and it appears it isn't (or doesn't make as much sense, anyway).

[+] ExpiredLink|12 years ago|reply

A lot of C++ code essentially is 'C with classes': no RAII, no exception handling.

[+] rdtsc|12 years ago|reply

Well LevelDB is already good. And if this improves on it, that's great.

I was looking at embedded key value stores and also found -- HyperLevelDB (from creators of Hyperdex database). They also improved on LevelDB in respect to compaction and locking:

http://hyperdex.org/performance/leveldb/

So now I am curios how it would compare.

Another interesting case optimized for reads is LMDB. That is a small but very fast embedded database at sits at the core of OpenLDAP. That one has impressive benchmarks.

http://symas.com/mdb/microbench/

(Note: LMDB used to be called MDB, you might know it by that name).

[+] AaronFriel|12 years ago|reply

The LMDB statistics are very strange - why is synchronous SSD performance worse on most figures than HDD performance? Something seems very wrong with these benchmarks:

    Section 5 (SSD) F (Synchronous Writes)
    
    Random Writes
    
    LevelDB              342 ops/sec	
    Kyoto TreeDB          67 ops/sec	
    SQLite3              114 ops/sec	
    MDB                  148 ops/sec	
    MDB, no MetaSync     322 ops/sec	
    BerkeleyDB           291 ops/sec	
    
    Section 8 (HDD) F (Synchronous Writes)
    
    Random Writes
    
    LevelDB             1291 ops/sec	
    Kyoto TreeDB          28 ops/sec	
    SQLite3              112 ops/sec	
    MDB                  297 ops/sec	
    BerkeleyDB           704 ops/sec

Really? LevelDB is four times faster on an HDD than an SSD with synchronous writes? BerkeleyDB is over twice as fast?

This smells.

[+] hyc_symas|12 years ago|reply

LSMs have a long long way to go to catch up to LMDB. http://symas.com/mdb/hyperdex/

[+] gfodor|12 years ago|reply

this is cool, though I'd wonder how it compares to Kyoto Cabinet. another big issue I've run into personally is the fact that both LevelDB and KC don't explicitly support multiple processes reading the db at once. (KC's API allows this but advises against it, LevelDB afaik doesn't even allow it.) I wonder if RocksDB gets past this.

[+] dhruba_b|12 years ago|reply

RocksDB allows only one process to open the DB in read-write mode but other process can access the DB read-only (with a few configuration settings)

[+] stass|12 years ago|reply

If you just need concurrent reading and update the database from a single process, you can use CDB. It always served me well.

[+] hyc_symas|12 years ago|reply

Kyoto Cabinet will self-corrupt if you use it that way. LMDB supports multi-process explicitly.

[+] maaku|12 years ago|reply

HyperDex?

[+] _kst_|12 years ago|reply

A very minor point:

The illustrative code snippet on the home page has a spurious semicolon on the first line:

    #include <assert>;

[+] jamesgpearce|12 years ago|reply

fixed! - thanks

[+] wbolster|12 years ago|reply

The benchmark at https://github.com/facebook/rocksdb/wiki/Performance-Benchma... states that for LevelDB, "in 24 hours it inserted only 2 million key-values", and that "each key is of size 10 bytes, each value is of size 800 bytes".

I might be missing something, but that took just a few minutes on my ~2 year old desktop machine. Sample code: https://gist.github.com/wbolster/7487225

[+] dhruba_b|12 years ago|reply

There was a typo, the 2 million should have been 200 million keys. I fixed the wiki page. Thanks again for pointing it out.

[+] parshap|12 years ago|reply

Node.js bindings (compatible with levelup) have already been released by rvagg: https://npmjs.org/package/rocksdb

[+] canadi|12 years ago|reply

Tnx for all the comments! Feel free to continue the discussion at https://www.facebook.com/groups/rocksdb.dev/

[+] unknown|12 years ago|reply

[deleted]

[+] arthursilva|12 years ago|reply

Looking forward to see this in Riak

72 comments