Note that this is not a database server, like Redis or Memcached. This is a database library, more along the lines of sqlite. In particular: "There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library" and only one process can access a database a time.
Hi Thank you for this explanation. My first thought was how is it different than Redis and then when I read no client-server support, I thought, this is worse than Redis. But then your comparison with sqlite makes it a little bit better. I can see a few use cases for it now. Thanks!
This could be great news for iOS. AFAIK, this is the first K/V DB that could be built for iOS (c++) and has a free license that is compatible with the AppStore (Tokyo Cabinet uses LGPL and the author has shown no interest to me in adding a static linking clause, Berkeley DB requires an expensive commercial license, etc).
Not key/value, but Mobile CouchBase (CouchDB) supports iOS and apps using it have been approved for the App Store.
An interesting point to note is that Mobile CouchBase is primarily written in Erlang, so porting any existing key/value stores is not dependant on their native language being C or C++.
This is pretty neat, and different from many other "simple" KV stores, because it arranges things in key-order on disk, and permits forward/reverse iteration over keys. (ie, range queries are cheap)
edit: it's actually full log-structured merge, sstables, memtables etc.
The leveling here reminds me a bit of how Lucene organizes and compacts its segments files. When you write documents to Lucene they are written to a segments file, usually one for each batch of documents you write. When searching, each of the segments files are searched for terms, so the more segments files you have, the slower the search tends to be. When segments files reach a certain size they are automatically compacted into other segments files. Users can (and should) then "optimize" their index, which compacts all the segments files into a single file. The log files and sorted tables of leveldb appear to work very similarly.
The idea behind this approach is to simplify the maintenance of the copyright line and to encourage collective ownership of the code. I'm told that some open-source projects have conflicts about who's name appears in which files, but I've not experienced that first-hand.
leveldb is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration).
bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications.
bitcask can guarantee at most one disk seek per lookup I think. leveldb may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks.
Anybody know what hardware those benchmarks were on? Is it actually durable? They mention problems with writeback caching, but are they at least durable from the OS perspective? How well does it scale?
It is durable from an OS perspective. I started typing in a long explanation, but I think the comment for the "sync" field in the options.h file captures things well:
struct WriteOptions {
// If true, the write will be flushed from the operating system
// buffer cache (by calling WritableFile::Sync()) before the write
// is considered complete. If this flag is true, writes will be
// slower.
//
// If this flag is false, and the machine crashes, some recent
// writes may be lost. Note that if it is just the process that
// crashes (i.e., the machine does not reboot), no writes will be
// lost even if sync==false.
//
// In other words, a DB write with sync==false has similar
// crash semantics as the "write()" system call. A DB write
// with sync==true has similar crash semantics to a "write()"
// system call followed by "fsync()".
//
// Default: false
bool sync;
We don't have much experience with scaling to larger databases yet. There are known problems which will cause a (smallish) constant factor slowdown in write performance after the database becomes a few (10?) GB in size, but I don't recall the details all that well, and the implementation has changed somewhat since that experiment. I would like to characterize this better and fix things so we can support somewhere between 100GB-1TB databases well. It just hasn't become a priority yet.
The benchmark numbers on the linked page were from a small million entry database that easily fits in the OS buffer cache.
Was really psyched until I saw --std=c++0x in the Makefile. What 0x features does leveldb depend on, and can those dependencies be realistically removed?
The use of c++0x should be limited to the small AtomicPointer class in port_posix.h. The class uses <cstdatomic> for its implementation. If you can provide a different implementation of "acquire store" and "release load" for your architecture, you should be all set without the need for c++0x.
I suspect that you may find suitable code in the chromium source code. See leveldb's port_chromium.h.
We have no such plans. The library is designed so that concurrency control is outside its scope. If I needed it for an application, I might consider using some application level code that looks like:
bool CAS(db, key, oldvalue, newvalue) {
lock some mutex;
read key's value from db;
bool result = (value == oldvalue);
if (result) write key=>newvalue to db;
unlock;
return result;
}
This should be not much slower than any hard-wired CAS we could provide from inside leveldb.
Based on their numbers it looks slower than TokyoCabinet but still much faster than BerkleyDB. Would need to design a head to head benchmark.
I don't think Redis is a good comparison as that's an in-memory database so better suited for the 10% of hot data you'd need to cache. Whereas disk stores like TokyoCabinet and LevelDB would be great for storing the other 90-100%. If your use case involves a large dataset and you don't have terabytes of RAM lying around, that is.
Chromium compiles leveldb using the standard Xcode toolchain, but has a dependency on Chromium's "base" library, so it's for sure possible. I can't tell you the exact steps, but I suspect it wouldn't be too hard to get the posix port compiling with Xcode as well though.
[+] [-] necubi|15 years ago|reply
[+] [-] rb2k_|15 years ago|reply
[+] [-] amjith|15 years ago|reply
[+] [-] rajasharan|15 years ago|reply
[+] [-] stephth|15 years ago|reply
[+] [-] skidooer|15 years ago|reply
An interesting point to note is that Mobile CouchBase is primarily written in Erlang, so porting any existing key/value stores is not dependant on their native language being C or C++.
[+] [-] electrum|15 years ago|reply
[+] [-] metabrew|15 years ago|reply
edit: it's actually full log-structured merge, sstables, memtables etc.
[+] [-] psaccounts|15 years ago|reply
The only paper on this data structure seems to be: http://goo.gl/CVF1l
This paper is poorly written and quite honestly not useful to implement an LSM tree. Does anyone know of a better paper than this one?
[+] [-] timr|15 years ago|reply
[+] [-] br1|15 years ago|reply
[+] [-] joeshaw|15 years ago|reply
[+] [-] cool-RR|15 years ago|reply
[+] [-] ivank|15 years ago|reply
[+] [-] abarth|15 years ago|reply
[+] [-] jhawk28|15 years ago|reply
[+] [-] sghemawat|15 years ago|reply
bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications.
bitcask can guarantee at most one disk seek per lookup I think. leveldb may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks.
[+] [-] yingfeng|15 years ago|reply
#include <boost/atomic.hpp>
class AtomicPointer { private: boost::atomic<void> rep_; public: AtomicPointer() { } explicit AtomicPointer(void v) { } inline void* Acquire_Load() const { return rep_.load(boost::memory_order_acquire); } inline void Release_Store(void* v) { rep_.store(v, boost::memory_order_release); } inline void* NoBarrier_Load() const { return rep_.load(boost::memory_order_relaxed); } inline void NoBarrier_Store(void* v) { rep_.store(v, boost::memory_order_relaxed); } };
[+] [-] leif|15 years ago|reply
[+] [-] sghemawat|15 years ago|reply
The benchmark numbers on the linked page were from a small million entry database that easily fits in the OS buffer cache.
[+] [-] latch|15 years ago|reply
Safe to assume the data fit into memory..whatever amount there was.
[+] [-] joe_the_user|15 years ago|reply
I had to roll my own b-tree library specifically to get this feature since nothing out-there had it.
[+] [-] fleitz|15 years ago|reply
Then lookup the n-th largest key and then pull the values?
eg. SELECT value from key_value_pairs order by size(key) LIMIT N
[+] [-] argvzero|15 years ago|reply
[+] [-] sghemawat|15 years ago|reply
I suspect that you may find suitable code in the chromium source code. See leveldb's port_chromium.h.
[+] [-] Emore|15 years ago|reply
[+] [-] sghemawat|15 years ago|reply
[+] [-] 18pfsmt|15 years ago|reply
[+] [-] abarth|15 years ago|reply
[+] [-] gmarkov|15 years ago|reply
[+] [-] evangineer|15 years ago|reply
Hackers should be able to figure out which bests fits their particular use case.
[+] [-] ma2rten|15 years ago|reply
[+] [-] davidhollander|15 years ago|reply
I don't think Redis is a good comparison as that's an in-memory database so better suited for the 10% of hot data you'd need to cache. Whereas disk stores like TokyoCabinet and LevelDB would be great for storing the other 90-100%. If your use case involves a large dataset and you don't have terabytes of RAM lying around, that is.
[+] [-] rb2k_|15 years ago|reply
[+] [-] foobarbazetc|15 years ago|reply
[+] [-] jorlow|15 years ago|reply
[+] [-] davisp|15 years ago|reply
http://code.google.com/p/leveldb/issues/detail?id=2
[+] [-] forgotusername|15 years ago|reply
[+] [-] mcorrientes|15 years ago|reply
A performance comparision with memcached would be interesting.
[+] [-] latch|15 years ago|reply
get/put/delete vs http://redis.io/commands