LevelDB: a fast and lightweight key/value database library

[+] necubi|15 years ago|reply

Note that this is not a database server, like Redis or Memcached. This is a database library, more along the lines of sqlite. In particular: "There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library" and only one process can access a database a time.

[+] rb2k_|15 years ago|reply

It's the same model that Tokyo/Kyoto Cabinet uses. The difference is that Tokyo/Kyoto Tyrant is an available network interface :)

[+] amjith|15 years ago|reply

Hi Thank you for this explanation. My first thought was how is it different than Redis and then when I read no client-server support, I thought, this is worse than Redis. But then your comparison with sqlite makes it a little bit better. I can see a few use cases for it now. Thanks!

[+] rajasharan|15 years ago|reply

I think saying it's not like 'Memcached' could be wrong. I can think of Memcached as a specialized case of LevelDB.

[+] stephth|15 years ago|reply

This could be great news for iOS. AFAIK, this is the first K/V DB that could be built for iOS (c++) and has a free license that is compatible with the AppStore (Tokyo Cabinet uses LGPL and the author has shown no interest to me in adding a static linking clause, Berkeley DB requires an expensive commercial license, etc).

[+] skidooer|15 years ago|reply

Not key/value, but Mobile CouchBase (CouchDB) supports iOS and apps using it have been approved for the App Store.

An interesting point to note is that Mobile CouchBase is primarily written in Erlang, so porting any existing key/value stores is not dependant on their native language being C or C++.

[+] electrum|15 years ago|reply

What advantages does this have over the built-in SQLite?

[+] metabrew|15 years ago|reply

This is pretty neat, and different from many other "simple" KV stores, because it arranges things in key-order on disk, and permits forward/reverse iteration over keys. (ie, range queries are cheap)

edit: it's actually full log-structured merge, sstables, memtables etc.

[+] psaccounts|15 years ago|reply

The implementation seems to use Log-Structured Merge Trees.

The only paper on this data structure seems to be: http://goo.gl/CVF1l

This paper is poorly written and quite honestly not useful to implement an LSM tree. Does anyone know of a better paper than this one?

[+] timr|15 years ago|reply

http://labs.google.com/papers/bigtable.html

[+] br1|15 years ago|reply

Read "Cache-Oblivious Streaming B-trees" http://supertech.csail.mit.edu/cacheObliviousBTree.html It' LSM with faster searches.

[+] joeshaw|15 years ago|reply

The leveling here reminds me a bit of how Lucene organizes and compacts its segments files. When you write documents to Lucene they are written to a segments file, usually one for each batch of documents you write. When searching, each of the segments files are searched for terms, so the more segments files you have, the slower the search tends to be. When segments files reach a certain size they are automatically compacted into other segments files. Users can (and should) then "optimize" their index, which compacts all the segments files into a single file. The log files and sorted tables of leveldb appear to work very similarly.

[+] cool-RR|15 years ago|reply

All the code files have this as a first line:

    // Copyright (c) 2011 The LevelDB Authors. All rights reserved.

I wonder how come it's not copyrighted to Google.

[+] ivank|15 years ago|reply

That's what Google does when they set up a project that accepts contributions. Google Inc. is the author; see http://code.google.com/p/leveldb/source/browse/trunk/AUTHORS

[+] abarth|15 years ago|reply

The idea behind this approach is to simplify the maintenance of the copyright line and to encourage collective ownership of the code. I'm told that some open-source projects have conflicts about who's name appears in which files, but I've not experienced that first-hand.

[+] jhawk28|15 years ago|reply

This looks like it has similar goals to bitcask. The tradeoffs and differences would be nice to know.

[+] sghemawat|15 years ago|reply

leveldb is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration).

bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications.

bitcask can guarantee at most one disk seek per lookup I think. leveldb may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks.

[+] yingfeng|15 years ago|reply

I uses boost::atomic to make it usable without c++0x http://www.chaoticmind.net/~hcb/projects/boost.atomic/

#include <boost/atomic.hpp>

class AtomicPointer { private: boost::atomic<void> rep_; public: AtomicPointer() { } explicit AtomicPointer(void v) { } inline void* Acquire_Load() const { return rep_.load(boost::memory_order_acquire); } inline void Release_Store(void* v) { rep_.store(v, boost::memory_order_release); } inline void* NoBarrier_Load() const { return rep_.load(boost::memory_order_relaxed); } inline void NoBarrier_Store(void* v) { rep_.store(v, boost::memory_order_relaxed); } };

[+] leif|15 years ago|reply

Anybody know what hardware those benchmarks were on? Is it actually durable? They mention problems with writeback caching, but are they at least durable from the OS perspective? How well does it scale?

[+] sghemawat|15 years ago|reply

It is durable from an OS perspective. I started typing in a long explanation, but I think the comment for the "sync" field in the options.h file captures things well:

  struct WriteOptions {
    // If true, the write will be flushed from the operating system
    // buffer cache (by calling WritableFile::Sync()) before the write
    // is considered complete.  If this flag is true, writes will be
    // slower.
    //
    // If this flag is false, and the machine crashes, some recent
    // writes may be lost.  Note that if it is just the process that
    // crashes (i.e., the machine does not reboot), no writes will be
    // lost even if sync==false.
    //
    // In other words, a DB write with sync==false has similar
    // crash semantics as the "write()" system call.  A DB write
    // with sync==true has similar crash semantics to a "write()"
    // system call followed by "fsync()".
    //
    // Default: false
    bool sync;

We don't have much experience with scaling to larger databases yet. There are known problems which will cause a (smallish) constant factor slowdown in write performance after the database becomes a few (10?) GB in size, but I don't recall the details all that well, and the implementation has changed somewhat since that experiment. I would like to characterize this better and fix things so we can support somewhere between 100GB-1TB databases well. It just hasn't become a priority yet.

The benchmark numbers on the linked page were from a small million entry database that easily fits in the OS buffer cache.

[+] latch|15 years ago|reply

it says so on the page: CPU: 4 x Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz

Safe to assume the data fit into memory..whatever amount there was.

[+] joe_the_user|15 years ago|reply

What I would like to know is, is there support accessing values by nth-largest key?

I had to roll my own b-tree library specifically to get this feature since nothing out-there had it.

[+] fleitz|15 years ago|reply

Why not index size(key)?

Then lookup the n-th largest key and then pull the values?

eg. SELECT value from key_value_pairs order by size(key) LIMIT N

[+] argvzero|15 years ago|reply

Was really psyched until I saw --std=c++0x in the Makefile. What 0x features does leveldb depend on, and can those dependencies be realistically removed?

[+] sghemawat|15 years ago|reply

The use of c++0x should be limited to the small AtomicPointer class in port_posix.h. The class uses <cstdatomic> for its implementation. If you can provide a different implementation of "acquire store" and "release load" for your architecture, you should be all set without the need for c++0x.

I suspect that you may find suitable code in the chromium source code. See leveldb's port_chromium.h.

[+] Emore|15 years ago|reply

@sghemawat: Are you planning on implementing atomic CAS operations?

[+] sghemawat|15 years ago|reply

We have no such plans. The library is designed so that concurrency control is outside its scope. If I needed it for an application, I might consider using some application level code that looks like:

    bool CAS(db, key, oldvalue, newvalue) {
      lock some mutex;
      read key's value from db;
      bool result = (value == oldvalue);
      if (result) write key=>newvalue to db;
      unlock;
      return result;
    }

This should be not much slower than any hard-wired CAS we could provide from inside leveldb.

[+] 18pfsmt|15 years ago|reply

It looks like this is a local storage tool, similar to sqlite. Is that correct? Lots of chromium devs on that list.

[+] abarth|15 years ago|reply

That's correct. We're using leveldb as the back-end for IndexedDB in Chrome.

[+] gmarkov|15 years ago|reply

something similar + benchmarks http://stsdb.com

[+] evangineer|15 years ago|reply

"STSdb is coded in 100% pure managed C#." & tri-licensed (GPL2/GPL3/Commercial) vs LevelDB's C++ & New BSD license.

Hackers should be able to figure out which bests fits their particular use case.

[+] ma2rten|15 years ago|reply

Does someone have any idea, how the benchmark results compare to other databases (e.g. redis) ?

[+] davidhollander|15 years ago|reply

Based on their numbers it looks slower than TokyoCabinet but still much faster than BerkleyDB. Would need to design a head to head benchmark.

I don't think Redis is a good comparison as that's an in-memory database so better suited for the 10% of hot data you'd need to cache. Whereas disk stores like TokyoCabinet and LevelDB would be great for storing the other 90-100%. If your use case involves a large dataset and you don't have terabytes of RAM lying around, that is.

[+] rb2k_|15 years ago|reply

It doesn't seem to have a network interface. It is somewhat comparable to InnoDB or Tokyo/Kyoto Cabinet

[+] foobarbazetc|15 years ago|reply

Anyone manage to compile this on OS X?

[+] jorlow|15 years ago|reply

Chromium compiles leveldb using the standard Xcode toolchain, but has a dependency on Chromium's "base" library, so it's for sure possible. I can't tell you the exact steps, but I suspect it wouldn't be too hard to get the posix port compiling with Xcode as well though.

[+] davisp|15 years ago|reply

Managed to get it working last night:

http://code.google.com/p/leveldb/issues/detail?id=2

[+] forgotusername|15 years ago|reply

You'll need a compiler from MacPorts: it wants C++0x which Apple's gcc doesn't support yet.

[+] mcorrientes|15 years ago|reply

Sounds like a possible redis or memcached replacement.

A performance comparision with memcached would be interesting.

[+] latch|15 years ago|reply

replacement is a strong word when you consider that you are talking about:

get/put/delete vs http://redis.io/commands

79 comments