On moving from CouchDB to Riak

[+] smanek|15 years ago|reply

I went through a length evaluation process of Riak recently, and came away with a generally positive impression.

First of all, it's beautifully engineered, as long as you just need a KV store or a graph DB (I wasn't in love with the MapReduce stuff, but that's another story). None of the hassle that Hadoop/Hbase have about some nodes being special (HBase Master, HDFS Namenode, etc). Also, no running multiple daemons per server (e.g., no distinction between regionserver and datanode daemons, like HBase). Easy config files, simple HTTP API (so you can just throw an off the shelf load balancer like HAProxy in front of it), and lots of little things that just make life easier.

I also really like how it's very explicit about the CAP tradeoffs it makes - with powerful conflict resolution available for when consistency has been sacrificed (instead of trying to sweep the issue under the rug, like many other distributed dbs do).

However, there are a few downsides.

First, as mentioned in the article, with the default backend (a kv-store called 'bitcask') all the keys per node have to fit in memory (and each key has, on the order of, 20 bits of overhead, IIRC). Annoyingly, this fact isn't noted anywhere in the Riak documentation that I saw (although, there is a work-around by using the InnoDB backed). This won't matter too much for many use cases, but it can be pretty painful if you're not expecting it.

Second, you can guard against data loss by specifying N copies of each piece of data are stored on your cluster. However, under some conditions (https://issues.basho.com/show_bug.cgi?id=228), the data may be replicated on less than N distinct nodes. So, the failure of less than N nodes could result in dataloss.

Finally, to the best of my knowledge, the largest Riak clusters in production are around 60 nodes, while Hadoop has 2000 node clusters running (e.g., FB) in production. Perfectly acceptable for most users, but just one more thing to worry about if you're planning to roll out a large cluster (more potential for 'unknown unknowns', so to speak).

[+] siculars|15 years ago|reply

I've given a few talks about NOSQL in general and Riak in particular and blogged[0] about both. I agree with all of your comments and will try to add a bit. One of the major driving design considerations of Riak is that of predictable scalability. Meaning that for each unit of hardware that you add to the system you are returned a predictable level of performance. You see this throughout the design of the system. Homogeneity amongst the nodes - meaning that all nodes are the same and there are no special nodes - is a big win. So is tunable CAP, as you already mentioned. Another major win from my perspective is the modular nature of the code base. I think it is one of the best layouts of a large system that I have seen yet. If you look at the main Riak repo you will see that it is comprised of a few sub modules. Basically most of the system is compartmentalized in this fashion. Major advantage is that development of components can proceed at their own pace and components can be reused/mixed more easily.

Re. Key limitations when using the default bitcask backend:

There is a spreadsheet[1] that outlines the number of keys you can have in your system based on a few variables. Definitely worth checking out. Two things to note here. There is the hard limit of keys per machine - due to each machines max memory - and there is the larger limit of keys per cluster based on the max memory of the cluster. Note that all this applies specifically when using the bitcask backend. There has been lots of talk about how to change this going forward and I know the Basho team is looking into it. Riak is quite interesting in that it can support a number of different backends - at the same time even. So you could have a bitcask backend for some data and a memory backend for other data. Since Riak is distributed, the implication is that outside of thinking about a single machines resources you should also think of the total clusters resources. Particularly, cluster total memory, total cpu cores and total disk spindles, that last one quite important but under-considered.

Re. max nodes in production:

One of the major considerations when dealing with a Dynamo derived system like Riak is cross node chatter. Riak does a lot of its magic by way of a gossip channel that is sending around all kinds of data. As the number of nodes in the system increases the level of chatter increases. I think there needs to be more work in optimizing that chatter for larger cluster sizes and I that has been happening if you follow the change logs.

If you follow the mailing list or irc you will notice that the primary concern amongst people new to Riak is that of querying. Riak has no secondary indexes, outside of Riak search, which is a separate download that is built on top of Riak (see code modularization). Riak has no native ordering. All of those things need to happen in the m/r phase. As it stands I think this is one of the major friction points to further adoption and why I have consistently recommended pairing Riak with Redis whenever possible.

These limitations aside, Riak represents, IMHO, the best mix of distribution, ease of use, raw power and growth potential of the currently in production NOSQL persisted datastore offerings. Philosophically I am very much attuned to the Dynamo world view and as a strict adherent think that Riak is the best representative from that perspective. Take that for what you will.

Disclaimer - Riak and Redis are my nosql databases of choice.

[0]http://siculars.posterous.com

[1]https://spreadsheets.google.com/ccc?key=0Ak4OBkABJPsxdEowYXc...

[+] ot|15 years ago|reply

> (and each key has, on the order of, 20 bits of overhead, IIRC)

20 bits? Really? Less than an integer? Or did you mean bytes? (Not nitpicking here, I'm just curious)

[+] rnewson|15 years ago|reply

Describing Bigcouch as a bit of hack while admitting you've not used it seems a little unfair. That said, it's very valuable to hear how other people see our product.

Without doing a full-on sales pitch (I am not a salesman and do not portray one on TV), I should say that Bigcouch adds a lot of desired features to CouchDB, notably scaling beyond a single machine and sharding (which supports very large databases). We run several clusters of Bigcouch nodes in production and have done for some time, it's a thrill for me personally to see a distributed database take a real pounding and keep on trucking.

I've been meaning to try Riak myself, so you've inspired me to finally pull down the code and give it a proper examination.

[+] biot|15 years ago|reply

My reading was that they found CouchDB to have problems and adding complexity on top of a problematic base wasn't an appealing proposition for them.

[+] siculars|15 years ago|reply

Re. File size growth in Couch:

Couch files are written in an append only fashion so that all operations are considered appends, such as updates. The main upside is durability meaning the file is always readable and never left in an odd state. As noted, however, this has the downside of requiring compaction to reclaim disk space. You should note that if you are using bitcask, the default backend for riak, you will have the same problem that you had with couch as bitcask is also an aol (append only log) (aka. wol - write only log). Bitcask will also need compaction but as of the latest release there are various knobs to tweak regarding when that is queued up and when it is executed.

Re. database being one file:

I'm not exactly sure why couch uses one file but I suspect it is due to the use of b-trees internally. Would love to hear more from someone more experienced in couch. Riak splits its data amongst many files. Specifically one file (or 'cask' when using the default bitcask backend) per 'vnode' (virtual node, 64 by default). The drawback here is that you need to ensure that your environment has enough open file descriptors to service requests.

[+] rnewson|15 years ago|reply

There are several reasons for CouchDB's use of a single, strictly append-only file, but the b-tree is not one of them.

In fact, it's one of CouchDB's most clever tricks to store a b-tree in an append-only fashion (it's far more common to update in place).

Two reasons CouchDB is strictly append only.

1) Safety: By never overwriting data, it avoids a whole class of data corruption issues that plague update-in-place schemes.

2) Performance: Writing at the end of the file allocates new blocks, leading to lower fragmentation and less seeking than an update-in-place scheme.

"durability" does not mean "the file is...never left in an odd state". The right word there is "consistent". CouchDB provides both (In fact it provides all four ACID properties, but the scope of a transaction is constrained to a single document update).

Finally, I should point out that using a single file, as opposed to a series of append-only files (like Berkeley JE, for example) is just a pragmatic choice. Nothing in principle would prevent a JE style approach, it's just a lot of (quite hard) work to do well.

[+] davisp|15 years ago|reply

The main two reasons that CouchDB doesn't use multiple files per database are system limits and increased complexity.

CouchDB has a bit of an alternative design in that it accepts that people might be running a large number of databases on a single node. I don't remember the exact numbers but I think we've heard of deploys using 10-100K (small) db's on a single node.

As to complexity, with a single file, there's no magical fsync dance to coordinate when committing data to disk. Its not unpossible, its just more complex.

Its not out of the question that CouchDB will move to using multiple files per database, but as its open source, the biggest road block so far is that no one's needed it badly enough to implement it.

[+] philwise|15 years ago|reply

"We store a lot of data... 2TB"

Given that a pair of 2TB drives is less than $250 on ebuyer right now, 2TB of data is not 'big data'. You could comfortably stuff that in any decent database (SQL Server for example, I'm sure PostgreSQL would work too).

Just because a tiny machine on slicehost isn't big enough doesn't mean that your data won't fit in a normal database.

[+] defroost|15 years ago|reply

> You could comfortably stuff that in any decent database (SQL Server for example, I'm sure PostgreSQL would work too).

You may have noticed that the OP's requirements were: * a REST interface * sharding The costs of SQL Server are unbelievably high, and which you can do horizontal partitioning and sharding, Riak is designed to be used in this manner. On the other hand, re: SQL Server according to this article the "horizontal partitioning feature requires Enterprise or Datacenter licenses which have a retail price of 27,495 and 54,990 per processor." I think the OP made a wise choice. http://www.infoq.com/news/2011/02/SQL-Sharding

[+] erikpukinskis|15 years ago|reply

Just because your data will fit in a normal database doesn't mean it will be fast enough.

[+] rch|15 years ago|reply

Anyone planning to store significant quantities of data across multiple distinct nodes should have a look at the Disco Distributed File System:

http://discoproject.org/doc/start/ddfs.html

It handles 'data distribution, replication, persistence, addressing and access' quite well.

[+] electrum|15 years ago|reply

> We rolled out a cluster of 5 nodes and 1024 partitions. Each node has 8gig of memories, 3x1Tb disk in RAID5.

This hardware configuration doesn't make sense. 8 gigabytes of memory is pathetically small today. 32 gigs is basically standard, with 64 gigs costing little extra. RAID 5 with only three disks will have horrible performance. A single machine with 32/64 gigs and 12-16 disks in a better RAID configuration should perform better at a lower operational cost.

[+] unknown|15 years ago|reply

[deleted]

[+] davidhollander|15 years ago|reply

What I disliked about Riak is that although at first glance it appears you can namespace key\values into multiple buckets you can't. The whole database is really only a single bucket and if you want to run a map\reduce it's always over every key in the database!

Now you can simply increase complexity and create multiple Riak clusters for different data schemes and treat it as a replication tool. However, their in memory bitcask map-reduce was actually slower than a hard-drive map-reduce on text JSON files in a folder. Where each file was individually opened, read from disk, encoded, and closed in a loop (with nothing held in memory). Which was rather scary!

Their replication scheme appears to be a pretty genuine copy of Amazon Dynamo though (consistent hashing ring, quorum, merkle trees) which is nice.

[+] aphyr|15 years ago|reply

To my knowledge, there is no such thing as an in-memory bitcask backend for Riak. Could you have been using regular Bitcask (keeps values on disk in log-structured files) or the ETS memory backend?

[+] rb2k_|15 years ago|reply

It's great to see that I'm not the only person having problem with CouchDB's compaction failing when dealing with larger datasets.

I had the same exact experience during my master's thesis (hn: http://news.ycombinator.com/item?id=2022192 ) and this was one of the reasons why CouchDB didn't seem a good solution for a dataset that has a high number of updates.

[+] tillk|15 years ago|reply

I'm a little let down by the detail of this blog post.

For anyone who cares: we've had 6x as many documents (as linkfluence) in a single CouchDB instane before we moved to Cloudant. We now have over 360 million documents on Cloudant.

Our database is very write intense, lots of new documents, little updates to the actual documents and a lot of reads of the documents in between. We also have a couple larger views to aggregate the data for our users.

The ride with CouchDB has been bumpy at times, but not at the 'scale' where linkfluence is at.

[+] karmi|15 years ago|reply

> Each modification generates a new version of the document. (...) we don’t need it, and there’s no way to disable it (...)

Note that there's a `_revs_limit` setting available: <http://wiki.apache.org/couchdb/HTTP_database_API>. It's very beneficial for a use case like yours.

(Though I've seen CouchDB performing rather poorly when taking really heavy read/write load or compacting big datasets, on occasion.)

[+] davidw|15 years ago|reply

Hrm... with N "nosql" things, there's great potential for writing. Redis vs Riak. Cassandra to CouchDB. Why I gave up on Mongo and moved to Mnesia. And so on and so forth;-)

[+] pjscott|15 years ago|reply

Your comment applies to any job for which there are multiple competing tools.

39 comments