FoundationDB: A new generation of NoSQL database

[+] haberman|13 years ago|reply

This is a beautiful site with tons of information, graphs, lists, examples, etc. and yet omits the one thing that underpins it all: how it actually works.

The claims are: serializable ACID cross-machine transactions "without performance penalty." MVCC with optimistic concurrency control.

Optimistic concurrency control means that the server has to check the version on all modified data before the transaction commits. Cross-machine transactions mean that this version check has to happen on multiple machines. ACID means that both machines have to either commit (if all the version checks succeed) or roll back. How are you going to reconcile all of these requirements without resorting to two-phase commit, which most certainly has a performance penalty? And how are you going to get serializable transactions across machines without processing these cross-machine commits one at a time (waiting for two-phase commit each time)?

I'm just not seeing it.

[+] voidmain|13 years ago|reply

FoundationDB co-founder here.

I'll try to address your questions. This isn't a complete explanation of how our system works!

Most people trying to solve this problem have approached it by trying to take multiple machines, each capable of processing transactions, and do distributed transactions between them. This leads to systems that do local transactions fast but global transactions much more slowly, and a tradeoff between the quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.

Our approach is instead to decouple the first job you mention (checking if transactions conflict to provide isolation) from the second job (making transactions durable). These can then be scaled independently.

The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast.

The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads.

Does that help at all?

[+] aphyr|13 years ago|reply

Serialized transactions over a broad distribution of keys isn't a huge problem, but you're right: this bodes poorly for hot data. I'm more concerned about their CAP semantics: I'm sorry, but claiming multidc availability and acid transactions is not gonna work.

[+] fdr|13 years ago|reply

Hmm. Okay. I open a transaction for three days (let's say indefinitely), maybe reading a few tuples here and there and updating a few, but never committing. How does this system not fall over? I guess it could abort my transaction...

Still, I think their performance numbers can be legit, because they are much, much slower than what a handful (much less than 24) machines could do if fully partitioned and without coordination. 500KRead/Second on such small values is not really fantastic performance over 24 machines. I also don't understand the initial burst-capacity on the read-side either, although I can guess -- on writes it can make sense because some work is deferred, but I'm trying to understand how that can happen on reads as well. My guess is a transaction ID allocation on-read that hits a wall somewhere.

Another common use case is I want to take a consistent backup/copy. For large databases, this is a very, very long snapshot to maintain, if MVCC, or a lot of locks to acquire, if 2PL. How does the system act then?

All in all, I think it's pretty neat, and I like that someone is dealing with OLTP database problems with a mind towards easing one's burden via transactions.

[+] voidmain|13 years ago|reply

FoundationDB co-founder here.

As other posters have already guessed, we don't support long-running transactions (we only keep multi-version information for a short time). So if you keep a transaction open for a long time it will not commit (after a while reads will start to fail, too).

It's not architecturally impossible for us to support long-running read snapshots, but it is expensive for our storage servers to keep snapshots alive for a long time. So our backup solution instead works by backing up transaction logs while doing a non-isolated read of the database. At restore time we will replay the logs to get back a consistent point-in-time snapshot.

The reason that we see higher burst than steady state performance has nothing to do with transactions. We have to do reads from SSD as the application requests them, and we have to make writes durable by logging them right away, but as you surmise we can defer the hard work of doing random writes to our btrees for a while. Even a workload that is 90% reads benefits a lot from deferring writes because writes have to be 3x replicated and because consumer SSDs are comparatively slow at mixed read/write workloads. Read only workloads will not see any initial burst (and might see a ramp-up time from cold cache effects).

[+] haberman|13 years ago|reply

> Okay. I open a transaction for three days (let's say indefinitely), maybe reading a few tuples here and there and updating a few, but never committing. How does this system not fall over?

They say they use optimistic concurrency control, so your writers aren't taking any locks, instead the transaction will just fail when you commit if any of your read/modify/write's were modified in the meantime.

[+] Andys|13 years ago|reply

Re your first point - this is a problem for PostgreSQL also.

[+] blackhole|13 years ago|reply

This very badly needs proof that it is, in fact, ACID compliant. Incredible claims require incredible evidence.

[+] Dave_Rosenthal|13 years ago|reply

FoundationDB co-founder here.

Yes, we provide the strongest level of ACID semantics. Although proving things about large computer programs is pretty hard, we have spent much of the past three years building testing and validations systems to ensure this is true. We run tens-of-thousands of nightly simulations to check system correctness and properties in the face of machine failures, partitions, etc. We also run long-running tests on real-world clusters using programmable network and power switches to simulate these same cases in the real world.

So, we've convinced ourselves. What would you like to see on the site to help provide the kind of incredible evidence you're looking for?

[+] cbsmith|13 years ago|reply

Honestly... if you have figured out how to scale with full ACID... I have a hard time believing that having SQL support in there is going to "kill" your ability to scale.

[+] Roboprog|13 years ago|reply

I'm guessing that no-SQL documents are written at a more course granularity than SQL table rows. One document could well represent 3, 4, 5 rows in a normalized SQL database.

OTOH, if you put all your "documents" in 3rd normal form, you might not see much gain.

[+] damian2000|13 years ago|reply

I recommend having a read of this article explaining the differences between CAP and ACID: http://www.infoq.com/articles/cap-twelve-years-later-how-the...

TL;DR

ACID represents one design philosophy at the consistency end of the consistency-availability spectrum. According to CAP, to have the maximum Consistency of ACID, you'd need to trade-off against a lower Availability and/or Partitionability.

[+] jynnan|13 years ago|reply

"Trading off against partinonability" is kind of a strange concept -- not only can you not prevent network partitions, you usually can't tell when they've happened until it's too late.

The real upshot of the CAP theorem is that you have to choose what goes in the face of a partition. An ACID system says you lose availability (writes, and possibly reads, fail if you can't reach a majority of the participants); an Eventually Consistent system may pick either.

[+] aaronblohowiak|13 years ago|reply

Explain how you can have availability in the face of network partitions and a perfect total ordering to events (ie: the sequentiality you claim.)

[+] DanWaterworth|13 years ago|reply

You can use Paxos. There will be some partitions that can cause an outage, but this is only the case when there is no majority of nodes that can communicate with each other, which in practice means you are exposing yourself to a vanishingly small risk.

[+] jamesaguilar|13 years ago|reply

Presumably they sacrifice availability? If your network is only partitioned 1/1000 of the time and you only need 99.9% uptime, that would be a viable approach.

[+] sausagefeet|13 years ago|reply

http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.p...

[+] tagx|13 years ago|reply

I worked under two of these guys two summers ago. They're really sharp guys and I have faith they can back up their claims.

[+] cbsmith|13 years ago|reply

Looking at what they are providing, it's a shared nothing architecture, with data stored in ordered form. Technically, you can make this work in a very scalable fashion with an SSTable style store.

Think of it as LevelDB with a distributed B+ tree (or even just a few extra levels) handling the partitioning between nodes. That can scale quite well, and wrap updates and reads with snapshots at very low overhead to provide all the key bits to handle ACID in a distributed database.

[+] shin_lao|13 years ago|reply

What's problematic is when you have a transaction that span amongst several nodes and you need to reconcile several transactions (commit phase).

The usual way to do it is to refuse the transaction if a conflict is detected, however this is a real performance and useability issue for a NoSQL database.

What I'm implying is that if you have ACID transactions but they can fail very easily, you don't offer much...

[+] ericcholis|13 years ago|reply

Perhaps I'm being a designer-snob, but Twitter's Bootstrap has quickly become too popular. I honestly have a hard time taking a site seriously when it uses the default Bootstrap look.

Having a unique visual identity is extremely important.

[+] Wilya|13 years ago|reply

Honestly, if someone wants to sell me a database, I'd prefer they spend money on database engineers rather than on designers. To each his own, I guess.

[+] forgotAgain|13 years ago|reply

I don't think it's a designer-snob thing. Taking the bootstrap provided website and a lot of overreaching claims of db goodness gives me an immediate sense of there being no there, there.

[+] Kudos|13 years ago|reply

I don't think you're being a designer-snob. Bootstrapping your design with Twitter Bootstrap is cool, making Twitter Bootstrap your design is not.

[+] crazygringo|13 years ago|reply

It's true. It's fine for a small MVP in beta, and even a good sign (shows they're concentrating on the product details the are important).

But launching an entire new database, does deserve its own visual identity. That said, their logo is rather nice!

[+] jhull|13 years ago|reply

Is it just me or is there a new "superior-than-the-others" NoSQL solution every week?

[+] bsg75|13 years ago|reply

Today's database is better than yesterday's.

[+] shin_lao|13 years ago|reply

I'd be surprised to see how they can be truly ACID and truly scale out.

This is an unsolved problem to me, I submit there are fine prints regarding ACID or scalability...

[+] wmf|13 years ago|reply

Jim Gray solved it in the 80s, but then we forgot about it.

[+] jacques_chester|13 years ago|reply

Without wanting to be any snarkier than usual, all-the-scale-without-the-compromise has been possible for decades. It's just really expensive.

[+] shansinha79|13 years ago|reply

good point.. the key difference here is that it's also "..without the cost"

[+] olalonde|13 years ago|reply

How?

[+] jynnan|13 years ago|reply

Your performance tests seem to have been run on a cluster that lives on one gigabit LAN. Have you tested a cluster that's distributed geographically?

[+] js4all|13 years ago|reply

The claim to be the first acid compliant NoSQL database is wrong.

Bigcouch and Couchbase are around for a while. There is also CouchDB (without the scaling).

[+] voidmain|13 years ago|reply

I was nervous writing that sentence, because it's hard to be sure of the truth of any claim to be 'first'! But none of the examples you mention, to my knowledge, provide multi-key ACID transactions. A compare and set facility for an individual key/document/row/entity group/etc is a very useful feature, but cannot be used to provide atomicity or isolation for transactions that read or write more than one key.

We plan to write something longer talking about the different levels of support that various products provide for A,C,I, and D.

73 comments