This is a beautiful site with tons of information, graphs, lists, examples, etc. and yet omits the one thing that underpins it all: how it actually works.
The claims are: serializable ACID cross-machine transactions "without performance penalty." MVCC with optimistic concurrency control.
Optimistic concurrency control means that the server has to check the version on all modified data before the transaction commits. Cross-machine transactions mean that this version check has to happen on multiple machines. ACID means that both machines have to either commit (if all the version checks succeed) or roll back. How are you going to reconcile all of these requirements without resorting to two-phase commit, which most certainly has a performance penalty? And how are you going to get serializable transactions across machines without processing these cross-machine commits one at a time (waiting for two-phase commit each time)?
I'll try to address your questions. This isn't a complete explanation of how our system works!
Most people trying to solve this problem have approached it by trying to take multiple machines, each capable of processing transactions, and do distributed transactions between them. This leads to systems that do local transactions fast but global transactions much more slowly, and a tradeoff between the quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.
Our approach is instead to decouple the first job you mention (checking if transactions conflict to provide isolation) from the second job (making transactions durable). These can then be scaled independently.
The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast.
The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads.
Serialized transactions over a broad distribution of keys isn't a huge problem, but you're right: this bodes poorly for hot data. I'm more concerned about their CAP semantics: I'm sorry, but claiming multidc availability and acid transactions is not gonna work.
Hmm. Okay. I open a transaction for three days (let's say indefinitely), maybe reading a few tuples here and there and updating a few, but never committing. How does this system not fall over? I guess it could abort my transaction...
Still, I think their performance numbers can be legit, because they are much, much slower than what a handful (much less than 24) machines could do if fully partitioned and without coordination. 500KRead/Second on such small values is not really fantastic performance over 24 machines. I also don't understand the initial burst-capacity on the read-side either, although I can guess -- on writes it can make sense because some work is deferred, but I'm trying to understand how that can happen on reads as well. My guess is a transaction ID allocation on-read that hits a wall somewhere.
Another common use case is I want to take a consistent backup/copy. For large databases, this is a very, very long snapshot to maintain, if MVCC, or a lot of locks to acquire, if 2PL. How does the system act then?
All in all, I think it's pretty neat, and I like that someone is dealing with OLTP database problems with a mind towards easing one's burden via transactions.
As other posters have already guessed, we don't support long-running transactions (we only keep multi-version information for a short time). So if you keep a transaction open for a long time it will not commit (after a while reads will start to fail, too).
It's not architecturally impossible for us to support long-running read snapshots, but it is expensive for our storage servers to keep snapshots alive for a long time. So our backup solution instead works by backing up transaction logs while doing a non-isolated read of the database. At restore time we will replay the logs to get back a consistent point-in-time snapshot.
The reason that we see higher burst than steady state performance has nothing to do with transactions. We have to do reads from SSD as the application requests them, and we have to make writes durable by logging them right away, but as you surmise we can defer the hard work of doing random writes to our btrees for a while. Even a workload that is 90% reads benefits a lot from deferring writes because writes have to be 3x replicated and because consumer SSDs are comparatively slow at mixed read/write workloads. Read only workloads will not see any initial burst (and might see a ramp-up time from cold cache effects).
> Okay. I open a transaction for three days (let's say indefinitely), maybe reading a few tuples here and there and updating a few, but never committing. How does this system not fall over?
They say they use optimistic concurrency control, so your writers aren't taking any locks, instead the transaction will just fail when you commit if any of your read/modify/write's were modified in the meantime.
Yes, we provide the strongest level of ACID semantics. Although proving things about large computer programs is pretty hard, we have spent much of the past three years building testing and validations systems to ensure this is true. We run tens-of-thousands of nightly simulations to check system correctness and properties in the face of machine failures, partitions, etc. We also run long-running tests on real-world clusters using programmable network and power switches to simulate these same cases in the real world.
So, we've convinced ourselves. What would you like to see on the site to help provide the kind of incredible evidence you're looking for?
Honestly... if you have figured out how to scale with full ACID... I have a hard time believing that having SQL support in there is going to "kill" your ability to scale.
I'm guessing that no-SQL documents are written at a more course granularity than SQL table rows. One document could well represent 3, 4, 5 rows in a normalized SQL database.
OTOH, if you put all your "documents" in 3rd normal form, you might not see much gain.
ACID represents one design philosophy at the consistency end of the consistency-availability spectrum. According to CAP, to have the maximum Consistency of ACID, you'd need to trade-off against a lower Availability and/or Partitionability.
"Trading off against partinonability" is kind of a strange concept -- not only can you not prevent network partitions, you usually can't tell when they've happened until it's too late.
The real upshot of the CAP theorem is that you have to choose what goes in the face of a partition. An ACID system says you lose availability (writes, and possibly reads, fail if you can't reach a majority of the participants); an Eventually Consistent system may pick either.
You can use Paxos. There will be some partitions that can cause an outage, but this is only the case when there is no majority of nodes that can communicate with each other, which in practice means you are exposing yourself to a vanishingly small risk.
Presumably they sacrifice availability? If your network is only partitioned 1/1000 of the time and you only need 99.9% uptime, that would be a viable approach.
Looking at what they are providing, it's a shared nothing architecture, with data stored in ordered form. Technically, you can make this work in a very scalable fashion with an SSTable style store.
Think of it as LevelDB with a distributed B+ tree (or even just a few extra levels) handling the partitioning between nodes. That can scale quite well, and wrap updates and reads with snapshots at very low overhead to provide all the key bits to handle ACID in a distributed database.
What's problematic is when you have a transaction that span amongst several nodes and you need to reconcile several transactions (commit phase).
The usual way to do it is to refuse the transaction if a conflict is detected, however this is a real performance and useability issue for a NoSQL database.
What I'm implying is that if you have ACID transactions but they can fail very easily, you don't offer much...
Perhaps I'm being a designer-snob, but Twitter's Bootstrap has quickly become too popular. I honestly have a hard time taking a site seriously when it uses the default Bootstrap look.
Having a unique visual identity is extremely important.
Honestly, if someone wants to sell me a database, I'd prefer they spend money on database engineers rather than on designers. To each his own, I guess.
I don't think it's a designer-snob thing. Taking the bootstrap provided website and a lot of overreaching claims of db goodness gives me an immediate sense of there being no there, there.
I was nervous writing that sentence, because it's hard to be sure of the truth of any claim to be 'first'! But none of the examples you mention, to my knowledge, provide multi-key ACID transactions. A compare and set facility for an individual key/document/row/entity group/etc is a very useful feature, but cannot be used to provide atomicity or isolation for transactions that read or write more than one key.
We plan to write something longer talking about the different levels of support that various products provide for A,C,I, and D.
[+] [-] haberman|13 years ago|reply
The claims are: serializable ACID cross-machine transactions "without performance penalty." MVCC with optimistic concurrency control.
Optimistic concurrency control means that the server has to check the version on all modified data before the transaction commits. Cross-machine transactions mean that this version check has to happen on multiple machines. ACID means that both machines have to either commit (if all the version checks succeed) or roll back. How are you going to reconcile all of these requirements without resorting to two-phase commit, which most certainly has a performance penalty? And how are you going to get serializable transactions across machines without processing these cross-machine commits one at a time (waiting for two-phase commit each time)?
I'm just not seeing it.
[+] [-] voidmain|13 years ago|reply
I'll try to address your questions. This isn't a complete explanation of how our system works!
Most people trying to solve this problem have approached it by trying to take multiple machines, each capable of processing transactions, and do distributed transactions between them. This leads to systems that do local transactions fast but global transactions much more slowly, and a tradeoff between the quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.
Our approach is instead to decouple the first job you mention (checking if transactions conflict to provide isolation) from the second job (making transactions durable). These can then be scaled independently.
The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast.
The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads.
Does that help at all?
[+] [-] aphyr|13 years ago|reply
[+] [-] fdr|13 years ago|reply
Still, I think their performance numbers can be legit, because they are much, much slower than what a handful (much less than 24) machines could do if fully partitioned and without coordination. 500KRead/Second on such small values is not really fantastic performance over 24 machines. I also don't understand the initial burst-capacity on the read-side either, although I can guess -- on writes it can make sense because some work is deferred, but I'm trying to understand how that can happen on reads as well. My guess is a transaction ID allocation on-read that hits a wall somewhere.
Another common use case is I want to take a consistent backup/copy. For large databases, this is a very, very long snapshot to maintain, if MVCC, or a lot of locks to acquire, if 2PL. How does the system act then?
All in all, I think it's pretty neat, and I like that someone is dealing with OLTP database problems with a mind towards easing one's burden via transactions.
[+] [-] voidmain|13 years ago|reply
As other posters have already guessed, we don't support long-running transactions (we only keep multi-version information for a short time). So if you keep a transaction open for a long time it will not commit (after a while reads will start to fail, too).
It's not architecturally impossible for us to support long-running read snapshots, but it is expensive for our storage servers to keep snapshots alive for a long time. So our backup solution instead works by backing up transaction logs while doing a non-isolated read of the database. At restore time we will replay the logs to get back a consistent point-in-time snapshot.
The reason that we see higher burst than steady state performance has nothing to do with transactions. We have to do reads from SSD as the application requests them, and we have to make writes durable by logging them right away, but as you surmise we can defer the hard work of doing random writes to our btrees for a while. Even a workload that is 90% reads benefits a lot from deferring writes because writes have to be 3x replicated and because consumer SSDs are comparatively slow at mixed read/write workloads. Read only workloads will not see any initial burst (and might see a ramp-up time from cold cache effects).
[+] [-] haberman|13 years ago|reply
They say they use optimistic concurrency control, so your writers aren't taking any locks, instead the transaction will just fail when you commit if any of your read/modify/write's were modified in the meantime.
[+] [-] Andys|13 years ago|reply
[+] [-] blackhole|13 years ago|reply
[+] [-] Dave_Rosenthal|13 years ago|reply
Yes, we provide the strongest level of ACID semantics. Although proving things about large computer programs is pretty hard, we have spent much of the past three years building testing and validations systems to ensure this is true. We run tens-of-thousands of nightly simulations to check system correctness and properties in the face of machine failures, partitions, etc. We also run long-running tests on real-world clusters using programmable network and power switches to simulate these same cases in the real world.
So, we've convinced ourselves. What would you like to see on the site to help provide the kind of incredible evidence you're looking for?
[+] [-] cbsmith|13 years ago|reply
[+] [-] Roboprog|13 years ago|reply
OTOH, if you put all your "documents" in 3rd normal form, you might not see much gain.
[+] [-] damian2000|13 years ago|reply
TL;DR
ACID represents one design philosophy at the consistency end of the consistency-availability spectrum. According to CAP, to have the maximum Consistency of ACID, you'd need to trade-off against a lower Availability and/or Partitionability.
[+] [-] jynnan|13 years ago|reply
The real upshot of the CAP theorem is that you have to choose what goes in the face of a partition. An ACID system says you lose availability (writes, and possibly reads, fail if you can't reach a majority of the participants); an Eventually Consistent system may pick either.
[+] [-] aaronblohowiak|13 years ago|reply
[+] [-] DanWaterworth|13 years ago|reply
[+] [-] jamesaguilar|13 years ago|reply
[+] [-] sausagefeet|13 years ago|reply
[+] [-] tagx|13 years ago|reply
[+] [-] cbsmith|13 years ago|reply
Think of it as LevelDB with a distributed B+ tree (or even just a few extra levels) handling the partitioning between nodes. That can scale quite well, and wrap updates and reads with snapshots at very low overhead to provide all the key bits to handle ACID in a distributed database.
[+] [-] shin_lao|13 years ago|reply
The usual way to do it is to refuse the transaction if a conflict is detected, however this is a real performance and useability issue for a NoSQL database.
What I'm implying is that if you have ACID transactions but they can fail very easily, you don't offer much...
[+] [-] ericcholis|13 years ago|reply
Having a unique visual identity is extremely important.
[+] [-] Wilya|13 years ago|reply
[+] [-] forgotAgain|13 years ago|reply
[+] [-] Kudos|13 years ago|reply
[+] [-] crazygringo|13 years ago|reply
But launching an entire new database, does deserve its own visual identity. That said, their logo is rather nice!
[+] [-] jhull|13 years ago|reply
[+] [-] bsg75|13 years ago|reply
[+] [-] shin_lao|13 years ago|reply
This is an unsolved problem to me, I submit there are fine prints regarding ACID or scalability...
[+] [-] wmf|13 years ago|reply
[+] [-] jacques_chester|13 years ago|reply
[+] [-] shansinha79|13 years ago|reply
[+] [-] olalonde|13 years ago|reply
[+] [-] jynnan|13 years ago|reply
[+] [-] js4all|13 years ago|reply
Bigcouch and Couchbase are around for a while. There is also CouchDB (without the scaling).
[+] [-] voidmain|13 years ago|reply
We plan to write something longer talking about the different levels of support that various products provide for A,C,I, and D.
[+] [-] d0ugal|13 years ago|reply
[+] [-] wmf|13 years ago|reply
[+] [-] Dave_Rosenthal|13 years ago|reply
[+] [-] christianaranda|13 years ago|reply
[+] [-] d0ugal|13 years ago|reply