Why Cassandra Doesn't Need Vector Clocks

[+] seancribbs|12 years ago|reply

Since they're not likely to approve my comment on their blog, here's what I said:

"Way to misrepresent[1] vector clock usage in Riak! LWW deliberately ignores the vector clock. No one would use that in production without a strong assurance that they will never have concurrent writes. Also note that later in the post[2] Kyle shows how using them properly leads to zero data-loss.

[1] https://yourlogicalfallacyis.com/strawman [2] http://aphyr.com/posts/285-call-me-maybe-riak"

To add on,

> Cassandra addresses the problem that vector clocks were designed to solve by breaking up documents/objects/rows into units of data that can be updated and merged independently. This allows Cassandra to offer improved performance and simpler application design.

This is not solving the problem vector clocks solve, it is punting on the resolution issue. Perhaps LWW partial updates result in greater performance, but they only solve performance.

Listen to or watch http://thinkdistributed.io/blog/2012/08/28/causality.html

To be fair, both designs are valid choices, but jbellis should be honest about his tradeoffs and not simply cast aspersions on other valid designs because they aren't the one that C* chose.

[+] aphyr|12 years ago|reply

I concur; this is punting on the resolution problem.

As far as I can determine in testing with Jepsen, there are no cases where one can safely (e.g. in a way which guarantees some causal connection of your write to a future state of the system) update a cell in Cassandra without a strong timestamp coordinator: either an external system like Zookeeper, or Cassandra 2.0 paxos transactions.

Most of the production users and datastax employees I've spoken with recommend structuring your data as a write-once immutable log of operations, and to merge conflicts on read. This is exactly the same problem that Riak siblings have--except that you don't get the benefit of automatic pruning of old versions, because there's no causality tracking. GC is up to the user.

[+] jbellis|12 years ago|reply

I explained my reasoning in more detail in the paragraphs starting with "Vector clocks are good at helping clients with simple merges like the above user object, but it’s important to understand that vector clocks only tell you that a conflict occurred, and not how to resolve it" and "What Cassandra gives up here is the ability to create custom behavior for updates-based-on-existing-values. However, as counters illustrate, vector clocks are often inadequate for this as well. Usually you end up having to store the entire version history, which Cassandra supports by clustering with uuids."

I think what you and aphyr are getting hung up on is this last part.

In any case, I wrote this because I got tired of people who read the Dynamo paper but haven't thought it through reacting to Cassandra with ZOMG YOU'LL SURELY LOSE DATA WITHOUT VECTOR CLOCKS. When in fact, this decision represents one that we made deliberately, for the reasons I've tried to describe. Maybe some not casting aspersions on other valid designs might be in order all around.

Thanks for reading!

[+] astral303|12 years ago|reply

Cassandra counters once were implemented as a blazing-fast vector-clock patch [1,2] that did not involve read-before-write on update. The biggest (IMO) downside is that it did not support decrements. So Cassandra tech leadership decided [2] to implement counters as "read, increment, write." However, that came with two major downsides: 1) counter updates were now no longer replayable--so if you experience an update timeout on the Cassandra side under heavy load or due to a lengthy compaction or some network fluke, you just can't replay it. You need to hope that it has made it over. Counters run on hope. But you can decrement them now. 2) read-before-write can severely limit performance in the case of a large number of counter writes (as the workload becomes much more read-intensive, unlike regular blazing-fast Cassandra writes).

1 - https://issues.apache.org/jira/browse/CASSANDRA-580 2 - https://issues.apache.org/jira/browse/CASSANDRA-1072

However, in many many a use case for counters, you just don't need to decrement. And if you need to decrement, you can often get away with two counters, that you subtract from each other. So that's a shame, as restricting the use case for counters to be monotonically increasing would've been just the right trade-off for many users.

[+] jbellis|12 years ago|reply

I think you've misread the 580 patch. By definition, you need to read the existing vector clock value before you can issue an update against it.

/Member of the presumedly incompetent Cassandra tech leadership

[+] ddorian43|12 years ago|reply

Hypertable has crazy fast increments-decrements(without reads). If the column is in memory it is updated on the spot. Else multiple values are inserted and merged on read or compaction. But it is a little different from cassandra, more similar to hbase.

[+] AaronFriel|12 years ago|reply

That's great, but what if you want vector clocks in Cassandra? For example, storing opaque blobs encrypted by a remote client. You might want to store both versions. Does Cassandra offer the ability to store lists of primitive data types and atomically insert/remove elements? This would allow clients to atomically append and then (with less need for guaranteed ordering) asynchronously attempt to remove the ancestor record. Even if the removal fails, future gets would pull the list and again attempt to synchronize to a single record.

I'd suppose that under very high contention you could have some problems, but at least you'd be able to implement vector clocks as a library without touching the Cassandra core.

Apologies if my question is common knowledge. And for anyone who is well versed in other key value stores, does it support vector clocks or something like the above (list elements with atomic append/prepend?)

[+] jbellis|12 years ago|reply

If you want to manage opaque blobs then you should probably use Riak or Voldemort. Cassandra gives you a set of one-size-fits-most tools that by and large work very well together, but if you want more of a build-your-own-database toolkit, it is a bad fit.

[+] ghc|12 years ago|reply

I don't see what's so scary about vector clocks. I have no trouble with them in Riak. Sure, there's a little extra logic to deal with certain types of siblings, but honestly I haven't encountered any difficult edge cases.

But then again, maybe it's highly dependent on what you use Riak for. Anyone out there have any type of application they've built with Riak that did encounter issues with using vector clocks?

[+] gigq|12 years ago|reply

It's worth noting that HBase has the same method of using LWW on column level updates. While this usually is what you want and like Cassandra it gives you the ability to do fast blind writes there are sometimes that you need to make sure you aren't having conflicting writes.

The solution that HBase employs is to have checkAndPut functionality. Basically what this lets you do is write a value and only successfully save the update if a given column is the value you expect.

So for example you could have a "records" table that has a column called "check" whenever you update a record you pull the old one, do whatever processing you want to do on it, set the "check" column to a new UUID and then save it with a checkAndPut where you specify the "check" column in hbase has to be the old check value you read. If any other process wrote to this row then it would have updated the check value with a new UUID and so this checkAndPut will fail thereby detecting the conflict. Now you can repull the row and handle any conflict resolution without blindly overwriting the changes.

[+] jbellis|12 years ago|reply

Cassandra exposes functionality similar to checkAndPut as Lightweight Transactions: www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0

[+] banachtarski|12 years ago|reply

This isn't anything dramatically new. Essentially, there is a tension between data normalization and requirements for strong consistency.

If my data is more segmented, I can afford to be less stringent about it. I think that foregoing vector clocks in an distributed database entirely though, is a recipe for disaster, no matter how denormalized my data is.

[+] marshray|12 years ago|reply

Riak/C* noob here. Couldn't I just store data as column cells in Riak?

    bucket: 'users'
    'jbellis/email', '[email protected]'
    'jbellis/phone', '555-5555'

What would I lose by doing this?

[+] iamaleksey|12 years ago|reply

Performance, mostly. To read both the email and the phone you'll have to make two requests (or a single multiget). The keys have a good chance of belonging to different replicas, too, and even if they don't, you are still going to have 2x disk reads.

[+] cbsmith|12 years ago|reply

I believe that is precisely what Riak does in the "Riak now does this" link provided, and that's effectively what is going on under the hood with Cassandra. This means more overhead per column and the loss of the ability to just encode the data fields in an application native serialization structure.

In effect, you throw out a lot of the less talked about advantages of NoSQL and end up with something more like traditional SQL databases (not necessarily a bad thing, just different).

[+] jtuple|12 years ago|reply

Avinash Lakshman make a decision early in Cassandra's life cycle not to use vector clocks for performance reasons. You gain performance, you lose the ability to safely update an existing key concurrently.

This is a reasonable tradeoff.

Cassandra has since evolved around this limitation: use write-once or immutable data, transform updates into appending new columns rather than modifications, handle all of this automagically through CQL datatypes, etc.

Really cool stuff. A nice way to evolve a product around a limitation and still be immensely useful.

But, seriously, it's a tradeoff. Defending that choice against those who question Cassandra is fine. But, jbellis has a history of claiming Cassandra is a "one-size-fits-most tool", proclaiming that vector clocks aren't needed in 99% of cases, etc. That's opinion presented as fact. Without omniscience no one really knows what all users/the market actually needs, what 100% of all use cases look like, etc. Let's cut the rhetoric and realize engineering is about tradeoffs.

I still wonder if Cassandra won't someday add vector clocks, just like they eventually added vnodes.

In Riak land, we have vector clocks. Do we pay a performance hit? Yes. Must we? Maybe.

In theory, vector clocks should only be expensive when updating existing data. If you are writing data once (the standard Cassandra approach), vector clocks should be free. They're not currently, but that's an implementation detail that Riak can fix. And we're considering fixing in the near future.

What about when you have multiple updates to the same key? You must pay a penalty there, right? Maybe. We're actively looking at approaches to reduce that penalty in the future as well. Summary: have multiple versions of an object, just append new version on write, read all versions on read and rollup siblings/LWW resolve. Basically, identical to "just append a column" that Cassandra uses.

It's easy to extend Riak to support the same operations Cassandra does, with the same performance characteristics, while still supporting conflict-detection for concurrent modifications. Best of both worlds. We'll like do it at some point too, as that's what engineers do -- we make products better over time.

As an side, a benefit of per-key conflict detection and siblings is that it doesn't require a sorted backend. The multi-version approach Cassandra uses requires sorted data to be efficient on reads.

While most users of Riak use the LevelDB backend today which is a sorted backend similar to Cassandra's backend (both log-structured SST systems), Riak also supports the non-sorted Bitcask backend for folks that want a high performance purely K/V store. Bitcask is about 10x faster than LevelDB in many workloads, because it doesn't waste time sorting/rewriting data. Write amplification really hurts for certain workloads.

Yet another tradeoff. Engineering is always about those pesky tradeoffs.

[+] rdtsc|12 years ago|reply

I don't know much about CAP but it doesn't sound right to me.

Let's see, you store your data tuple for example for a position say {X,Y} as 2 coordinates. X in one column and Y in another one.

Now when it gets updated concurrently and it needs to merge two conflicting {X1,Y1} and {X2,Y2} positions you could end up instead with {X1,Y2} non-existing/impossible/broken position.

Is that really that easily broken? Or am I not thinking about it straight.

[+] gigq|12 years ago|reply

It wasn't really mentioned in the article but if you do both {X1,Y1} and {X2,Y2} as full row updates then no you can't end up with a mixed row.

http://www.datastax.com/dev/blog/row-level-isolation

[+] brown9-2|12 years ago|reply

In Cassandra the second update agreed upon by the cluster would "win", so either (X1,Y1) or (X2,Y2). Since there is no clock mechanism there is no way for the cluster to see that the data has changed since the requester decided it wanted to update the value.

68 comments