I think Fauna is not very good at docs and communication yet, at least judging by confusion from some of the comments and by reading their docs. But launching will probably make them a lot better at it. Here are my notes which may add clarity for some:
similar to a RethinkDB/MongoDB:
* Designed to be great for storing application data. Fields can be dynamically added (schemaless) and their values can be arrays so it is easy to maintain data locality according to your application use patterns.
* Uses a non-SQL query language
* Probably not great for ad-hoc reporting (arguably SQL is a requirement for that)
Unlike MongoDB: supports joins
Unlike RethinkDB: great support for transactions, just not SQL transactions with an open session (which are unnecessary for an application)
Unlike most databases
* cloud-hosted and pay-for-use (on-premise is on their roadmap)
* claims support for graph data by storing arrays of references
* QoS built-in so you could run a slow analytics query without disrupting your application
Cons
* Unfortunately just like MongoDB/RethinkDB they have no real database-level integrity of schema and foreign keys, but at least foreign keys are on their roadmap.
I am a huge fan of the cloud-hosted pay for use aspect: I wonder why anyone would design a DB today without this in mind. You can transfer your data from a pay-for-use application DB (FaunaDB or Google DataStore) to a data warehouse (Snowflake or Google BigQuery) which is also pay for use and gives you SQL reporting abilities.
You might consider removing the badge on the home page that says "Global Latency 2.8 ms". Unless you really can give me latency across the globe of 2.8 ms, in which case your solution to the speed-of-light problem is quite impressive :)
It seems like one of the big issues with the marketing copy here is some of the word tricks being played:
Any first-read of "The first serverless database" has the implication that the database is serverless. Comments from FaunaDB folks in this comment page clearly indicate that they mean is that it's the first database for serverless, which is a pretty bold claim, given Google and AWS and any number of other providers offer databases that are accessible from serverless things, so it essentially boils down to "The first database that's marketed specifically to serverless use cases", which is maybe true but also kindof not a useful trophy to put on the mantle?
This is further muddled by the blog post linked to from the launch announcement (https://fauna.com/blog/escape-the-cloud-database-trap-with-s...), which includes "FaunaDB Serverless Cloud is an adaptive, serverless database". Nobody is reading that and thinking "ah, an adaptive database for serverless apps".
To describe it as "The first active-active multi-cloud database", is possibly true if you mean "the first time a single company has sold a publicly-available database-as-a-service running on multiple cloud providers". But the text says "database" where "public database-as-a-service" would be the accurate term, leaving the reader with the impression that no existing databases can be set up on multiple cloud providers in an active-active HA config, which is absurd. Fixing the copy here should be pretty easy, and they're already headed in the right direction with the next bullet point, although it as well refers to "database" where it means "database-as-a-service".
It feels like somebody on marketing really wanted to have a list of firsts, so they toyed with definitions of words until they thought they could flex these into being technically accurate. I get the same feel from the closing argument in the linked blog post: "The query language, data model (including graphs and change feeds), security features, strong consistency, scalability and performance are best in class. There is no downside.". I don't think I want to trust a database if the folks designing it couldn't think of any downsides.
Understand. We can be careful to be more accurate even in less technical contexts in the future.
Serverless is supposed to mean: a database with serverless pricing, for serverless applications.
There's always a tension, too, between "does a feature exist somewhere" and "is it actually usable?". For example, you could perhaps run MySQL Cluster across multiple public clouds...but would you want to? It's surprisingly hard to design for true cross-continent global replication, and the additional latency of crossing public clouds makes it even worse.
We've tried to design the best database. We can put in some bugs for you to give it some downsides.
Hey everybody, today we launched FaunaDB Serverless Cloud, 4 years in the making. FaunaDB is a strongly consistent, globally distributed operational database. It’s relational, but not SQL.
We're excited to open our doors and explain more of our design decisions. Our team is from Twitter, and that experience has deeply informed our interface and architecture. Try it out and let us know what you think.
By "Serverless", do you just mean DBaaS? "Serverless" in the context of a database is kinda weird because the data does have to be stored somewhere. This branding doesn't make much sense to me.
> It’s relational, but not SQL.
Why not SQL? Is there something your query language supports that SQL doesn't?
Why not? If you already support relational algebra, it seems like a non-brainer to just add SQL. Even if it's only SQL-92 you would be able to support some existing tools/ORMs almost for free.
This looks super neat, and I can't wait to learn more about it, but just for the record: I'm pretty sure this isn't the first serverless cloud database. Both Firebase's Realtime Database and Cloud Datastore (which powers Snapchat and Pokemon Go) are serverless; you pay only for your ops and storage. They've been publicly available for several years.
Fair enough; I think it depends where you draw the line between key/value store and database.
Both of those depend on other distributed storage systems under the hood, as far as I am aware? Or is Datastore an end to end system? I know Firebase was backed by MongoDB.
In technology evolution there are technologies that enable a new ecosystem, and then there are technologies that are built natively for that ecosystem. The previous generation of datastores enabled Lambda style applications, the next generation of databases assumes they are the new normal.
The reasons FaunaDB fits serverless like a glove can be boiled down to a few points: pay-as-you-go, database security awareness and object level access control, hierarchical multi tenancy with quality of service management. Running on multiple clouds makes the Serverless model more acceptable for risk averse enterprises, and complements multi-cloud serverless FaaS execution environments nicely.
I feel like I need to note the pricing. $0.01 per 1,000 queries. That doesn't sound like much, but it adds up. Let's say you make 1,000/sec. $0.01 * 60 seconds in a minute * 60 minutes in an hour * 24 hours in a day * 30 days in a month = $25,920.
Is that a lot? I think it is. Google Cloud Spanner costs $0.90/hour per node or around $650/mo. Each Cloud Spanner node can do around 10,000 queries per second[1]. So, $650 to Google gets you 10x the queries that $25,920 to Fauna gets you. I mean, for $25,920, you could get a Spanner cluster with 40 servers. Each of those servers would only have to handle 25 queries per second to get you 1,000 queries per second.
I'm sure that people are going to question whether FaunaDB can actually do what it claims. At this pricing, I can't imagine someone actually seeing if they can live up to their claims. They have a graph showing linear scaling to 2M reads per second. Based on their pricing, that would be $630M per year. For comparison, Snapchat committed to spending $400M per year on Google Cloud and another $100M on AWS (and people thought the spend was outrageous even for a company valued at tens of billions of dollars). This is more money for the database alone.
Heck, it looks like one can get 5-20k queries per second out of Google's Cloud SQL MySQL on a highmem-16 costing $1k/mo[2]. That would cost $130k-$500k on FaunaDB. It seems like the pricing of FaunaDB is off by a couple orders of magnitude.
Ultimately, Spanner is something built by people that published a notable research paper and used by Google. Reading the paper, you can understand how Spanner works and be saddened that you don't have TrueTime servers powered by GPS and atomic clocks. FaunaDB has some marketing speak about how I'll never have to worry about things ever again - without telling me how it will achieve that.
It's also implemented in Scala. This isn't a dig on Scala or the JVM, but I use three datastores on the JVM and only one isn't sad for it is Kafka. But Kafka does very little in the JVM - it basically just leans on sendfile to handle stuff which means you don't get bad GC cycles or lots of allocations and copying.
FaunaDB is a datastore without much information other than "it's great for everything and scales perfectly". Well, at their pricing, they might be able to make it happen. I mean, most customers would simply move to something cheaper as they got beyond small amounts of traffic due to the pricing. 60,000 queries per second? That'll be $18M per year from FaunaDB or $50k per year from Google. It's not even in the same ballpark. If you really need to scale to 2M reads per second, $630M seems like a lot more than $1.6M for Spanner.
Maybe it's an easy way to get some money off people that "need a web scale database", but are actually going to be serving like 10 queries per second and are willing to spend $260/mo to serve that. If they hit it big, it shouldn't be insane to scale it to 10,000 queries per second and milk $260k out of them each month for a workload that can be handled by a single machine. That money also pays for decent ops people to run a big box and consult with the customer if they're going towards 100k queries per second with a $2.6M monthly payment.
EDIT: looking over Fauna's blog and some of their comments here, they seem to understand more than their marketing lets on. Daniel Abadi is one of those people whose name carries weight in the databases world (having been involved with C-Store/Vertica, H-Store/VoltDB, and others). While I haven't read the Calvin paper, it looks like a good read. I can see that they are using logical clocks and I can't find it right now, but I thought I saw that they're not allowing one to keep transaction sessions checked out - that all the operations must be specified. So, it seems like there's some decent stuff in there that's currently being obscured by marketing-speak. Still, the pricing seems really curious.
To add extra color, for about $3M/month @ list prices of Cloud Datastore [1], you can, in a Multi-Region active-active synchronous replication configuration, run a workload with the following profile:
And that's if you don't use any of the nearly free optimizations like Projection queries & keys-only queries, which any large scale customer does.
That's not pre-provisioned usage, it's actual pay-as-you-go usage - so if you have no traffic, you have no costs (except for what's already stored). It's been that way for 8 years too.
Huge scale is what FaunaDB On-Premises is for; the pricing model is different. That's what NVIDIA uses for example. Nevertheless, we will have volume discounts and reserved capacity in Cloud too.
I see where you're coming from. People make the same argument against using cloud services at all when you can buy hardware yourself and operate it. The lack of flexibility is the hidden cost.
Our cloud pricing is competitive with other vendors, most of which require you to massively over-provision in order to get high availability, especially global availability, as well as predictable performance. In traditional cloud databases, you have to provision for peak load. Usually this is an order of magnitude difference from average load. An order of magnitude difference happens to matches your Spanner example exactly; however with Spanner, you still have to manage your capacity "by hand".
I understand that we are talking about a globally distributed, serverless and yet consistent relational database.
My question is about latency. How long does it take for transactional atomicity to become a consistent read on a globally distributed database? (1) And what are the measures taken between entry nodes to prevent clients from recieving inconsistent data? (2)
As I ponder this, I am struck by not the consistency problem, as that is solvable. But I am struck by the latency problem of assuring that all global queries are consistent for some (any) time quanta. What sort of latency should be expected?
both questions (1) and (2) are interesting, but (1) is critical while (2) is academic.
FaunaDB has a per-database replicated transaction log. Once a transaction has been globally committed to the log, it is applied to each local partition that covers part of the transaction. By this point, the transaction's order with respect to others in a database and results are determined. While writes require global coordination to commit, reads across partitions are coordinated via a snapshot time per query, which guarantees correctness.
In short, writes require a global round-trip through the transaction pipeline; reads are local and low latency.
Looks like FaunaDB uses Raft[1], so I'd expect that data is sharded into multiple consensus groups, like Spanner or Megastore. That would mean consistency on a single shard/consensus group is basically just dependent on reading from and writing to the Raft leader.
$0.01 per simple operation sounds very expensive to me. This would add up very quickly.
Edit: I misread it. Perhaps instead of inventing your own point system that you have to explain and hope silly people (like me) don't mix up you could take a lesson from Google Cloud and just lay out the pricing in a table. If you ever add another service you'll have to integrate it also into your made up points system.
That pricing model and serverless model is why I've always chosen CouchDB/Cloudant. If I'm doing the MB/hour to GB/month conversion correctly, Fauna cloud is significantly cheaper.
I see Fauna has temporal queries, but receiving events is strictly pull, there is no push or single feed?
Event push / feeds are on the roadmap. Currently we have everything implemented at the data model level to do live query feeds, you just have to do polling until we ship the feature.
I'm working on a follow up example to this CRUD one, that implements a multi-user TodoMVC, and will use event queries to keep the UI updated between tabs and users. You can see the basic Serverless CRUD starter example here: https://fauna.com/blog/serverless-cloud-database
You should explain how it works. It's not like I'm going to steal your ideas and spend five years implementing them ... or maybe I will if it's good ;)
I'm curious about the relational-ness of FaunaDB. e.g. How do you efficiently maintain integrity of foreign key constraints across the entire system? How fast and consistent are secondary indexes?
So... where does the data go? Maybe a simpleton question but I couldn't easily find an answer in the about section. If it's all function-based, where does the data actually get persisted?
This is a database FOR serverless style applications. It runs on servers like most databases, it's not made out of lambdas. But it's built so you don't have to worry about the details. When a traffic spike hits your app we'll keep up. And when your app is quiet, you don't pay for unused capacity.
[+] [-] gregwebs|9 years ago|reply
similar to a RethinkDB/MongoDB:
* Designed to be great for storing application data. Fields can be dynamically added (schemaless) and their values can be arrays so it is easy to maintain data locality according to your application use patterns.
* Uses a non-SQL query language
* Probably not great for ad-hoc reporting (arguably SQL is a requirement for that)
Unlike MongoDB: supports joins
Unlike RethinkDB: great support for transactions, just not SQL transactions with an open session (which are unnecessary for an application)
Unlike most databases
* cloud-hosted and pay-for-use (on-premise is on their roadmap)
* claims support for graph data by storing arrays of references
* QoS built-in so you could run a slow analytics query without disrupting your application
Cons
* Unfortunately just like MongoDB/RethinkDB they have no real database-level integrity of schema and foreign keys, but at least foreign keys are on their roadmap.
I am a huge fan of the cloud-hosted pay for use aspect: I wonder why anyone would design a DB today without this in mind. You can transfer your data from a pay-for-use application DB (FaunaDB or Google DataStore) to a data warehouse (Snowflake or Google BigQuery) which is also pay for use and gives you SQL reporting abilities.
[+] [-] freels|9 years ago|reply
We're definitely aware that the lack schema definition is a problem for certain use-cases, and solving this is on our roadmap.
[+] [-] ccleve|9 years ago|reply
[+] [-] evanweaver|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] evanweaver|9 years ago|reply
[+] [-] akerl_|9 years ago|reply
Any first-read of "The first serverless database" has the implication that the database is serverless. Comments from FaunaDB folks in this comment page clearly indicate that they mean is that it's the first database for serverless, which is a pretty bold claim, given Google and AWS and any number of other providers offer databases that are accessible from serverless things, so it essentially boils down to "The first database that's marketed specifically to serverless use cases", which is maybe true but also kindof not a useful trophy to put on the mantle?
This is further muddled by the blog post linked to from the launch announcement (https://fauna.com/blog/escape-the-cloud-database-trap-with-s...), which includes "FaunaDB Serverless Cloud is an adaptive, serverless database". Nobody is reading that and thinking "ah, an adaptive database for serverless apps".
To describe it as "The first active-active multi-cloud database", is possibly true if you mean "the first time a single company has sold a publicly-available database-as-a-service running on multiple cloud providers". But the text says "database" where "public database-as-a-service" would be the accurate term, leaving the reader with the impression that no existing databases can be set up on multiple cloud providers in an active-active HA config, which is absurd. Fixing the copy here should be pretty easy, and they're already headed in the right direction with the next bullet point, although it as well refers to "database" where it means "database-as-a-service".
It feels like somebody on marketing really wanted to have a list of firsts, so they toyed with definitions of words until they thought they could flex these into being technically accurate. I get the same feel from the closing argument in the linked blog post: "The query language, data model (including graphs and change feeds), security features, strong consistency, scalability and performance are best in class. There is no downside.". I don't think I want to trust a database if the folks designing it couldn't think of any downsides.
[+] [-] evanweaver|9 years ago|reply
Serverless is supposed to mean: a database with serverless pricing, for serverless applications.
There's always a tension, too, between "does a feature exist somewhere" and "is it actually usable?". For example, you could perhaps run MySQL Cluster across multiple public clouds...but would you want to? It's surprisingly hard to design for true cross-continent global replication, and the additional latency of crossing public clouds makes it even worse.
We've tried to design the best database. We can put in some bugs for you to give it some downsides.
[+] [-] michaelmior|9 years ago|reply
Edit: Thanks to all the commenters who corrected me :)
[+] [-] evanweaver|9 years ago|reply
We're excited to open our doors and explain more of our design decisions. Our team is from Twitter, and that experience has deeply informed our interface and architecture. Try it out and let us know what you think.
An on-premises release is coming later this year.
[+] [-] Veratyr|9 years ago|reply
By "Serverless", do you just mean DBaaS? "Serverless" in the context of a database is kinda weird because the data does have to be stored somewhere. This branding doesn't make much sense to me.
> It’s relational, but not SQL.
Why not SQL? Is there something your query language supports that SQL doesn't?
[+] [-] assface|9 years ago|reply
Why not? If you already support relational algebra, it seems like a non-brainer to just add SQL. Even if it's only SQL-92 you would be able to support some existing tools/ORMs almost for free.
[+] [-] wsh91|9 years ago|reply
This looks super neat, and I can't wait to learn more about it, but just for the record: I'm pretty sure this isn't the first serverless cloud database. Both Firebase's Realtime Database and Cloud Datastore (which powers Snapchat and Pokemon Go) are serverless; you pay only for your ops and storage. They've been publicly available for several years.
[+] [-] evanweaver|9 years ago|reply
Both of those depend on other distributed storage systems under the hood, as far as I am aware? Or is Datastore an end to end system? I know Firebase was backed by MongoDB.
[+] [-] jchrisa|9 years ago|reply
The reasons FaunaDB fits serverless like a glove can be boiled down to a few points: pay-as-you-go, database security awareness and object level access control, hierarchical multi tenancy with quality of service management. Running on multiple clouds makes the Serverless model more acceptable for risk averse enterprises, and complements multi-cloud serverless FaaS execution environments nicely.
There's more to say, check out this post on the blog: https://fauna.com/blog/serverless-cloud-database and https://fauna.com/blog/escape-the-cloud-database-trap-with-s...
[+] [-] mdasen|9 years ago|reply
Is that a lot? I think it is. Google Cloud Spanner costs $0.90/hour per node or around $650/mo. Each Cloud Spanner node can do around 10,000 queries per second[1]. So, $650 to Google gets you 10x the queries that $25,920 to Fauna gets you. I mean, for $25,920, you could get a Spanner cluster with 40 servers. Each of those servers would only have to handle 25 queries per second to get you 1,000 queries per second.
I'm sure that people are going to question whether FaunaDB can actually do what it claims. At this pricing, I can't imagine someone actually seeing if they can live up to their claims. They have a graph showing linear scaling to 2M reads per second. Based on their pricing, that would be $630M per year. For comparison, Snapchat committed to spending $400M per year on Google Cloud and another $100M on AWS (and people thought the spend was outrageous even for a company valued at tens of billions of dollars). This is more money for the database alone.
Heck, it looks like one can get 5-20k queries per second out of Google's Cloud SQL MySQL on a highmem-16 costing $1k/mo[2]. That would cost $130k-$500k on FaunaDB. It seems like the pricing of FaunaDB is off by a couple orders of magnitude.
Ultimately, Spanner is something built by people that published a notable research paper and used by Google. Reading the paper, you can understand how Spanner works and be saddened that you don't have TrueTime servers powered by GPS and atomic clocks. FaunaDB has some marketing speak about how I'll never have to worry about things ever again - without telling me how it will achieve that.
It's also implemented in Scala. This isn't a dig on Scala or the JVM, but I use three datastores on the JVM and only one isn't sad for it is Kafka. But Kafka does very little in the JVM - it basically just leans on sendfile to handle stuff which means you don't get bad GC cycles or lots of allocations and copying.
FaunaDB is a datastore without much information other than "it's great for everything and scales perfectly". Well, at their pricing, they might be able to make it happen. I mean, most customers would simply move to something cheaper as they got beyond small amounts of traffic due to the pricing. 60,000 queries per second? That'll be $18M per year from FaunaDB or $50k per year from Google. It's not even in the same ballpark. If you really need to scale to 2M reads per second, $630M seems like a lot more than $1.6M for Spanner.
Maybe it's an easy way to get some money off people that "need a web scale database", but are actually going to be serving like 10 queries per second and are willing to spend $260/mo to serve that. If they hit it big, it shouldn't be insane to scale it to 10,000 queries per second and milk $260k out of them each month for a workload that can be handled by a single machine. That money also pays for decent ops people to run a big box and consult with the customer if they're going towards 100k queries per second with a $2.6M monthly payment.
EDIT: looking over Fauna's blog and some of their comments here, they seem to understand more than their marketing lets on. Daniel Abadi is one of those people whose name carries weight in the databases world (having been involved with C-Store/Vertica, H-Store/VoltDB, and others). While I haven't read the Calvin paper, it looks like a good read. I can see that they are using logical clocks and I can't find it right now, but I thought I saw that they're not allowing one to keep transaction sessions checked out - that all the operations must be specified. So, it seems like there's some decent stuff in there that's currently being obscured by marketing-speak. Still, the pricing seems really curious.
[1] https://cloud.google.com/spanner/docs/instance-configuration
[2] https://www.pythian.com/blog/benchmarking-google-cloud-sql-i...
[+] [-] itcmcgrath|9 years ago|reply
Reads: >1.1M entities/second Write: >380K entities/second Delete: >190K/second Storage: 100TB
And that's if you don't use any of the nearly free optimizations like Projection queries & keys-only queries, which any large scale customer does.
That's not pre-provisioned usage, it's actual pay-as-you-go usage - so if you have no traffic, you have no costs (except for what's already stored). It's been that way for 8 years too.
[1]: https://cloud.google.com/products/calculator/#id=e21b61d5-4a...
(PM for Cloud Datastore - if you'll looking at 1M+ QPS workloads feel free to message me)
[+] [-] evanweaver|9 years ago|reply
I see where you're coming from. People make the same argument against using cloud services at all when you can buy hardware yourself and operate it. The lack of flexibility is the hidden cost.
Our cloud pricing is competitive with other vendors, most of which require you to massively over-provision in order to get high availability, especially global availability, as well as predictable performance. In traditional cloud databases, you have to provision for peak load. Usually this is an order of magnitude difference from average load. An order of magnitude difference happens to matches your Spanner example exactly; however with Spanner, you still have to manage your capacity "by hand".
Architecture docs are on the way.
[+] [-] Efrim-Lipkin|9 years ago|reply
I understand that we are talking about a globally distributed, serverless and yet consistent relational database.
My question is about latency. How long does it take for transactional atomicity to become a consistent read on a globally distributed database? (1) And what are the measures taken between entry nodes to prevent clients from recieving inconsistent data? (2)
As I ponder this, I am struck by not the consistency problem, as that is solvable. But I am struck by the latency problem of assuring that all global queries are consistent for some (any) time quanta. What sort of latency should be expected?
both questions (1) and (2) are interesting, but (1) is critical while (2) is academic.
Thanks, and very interesting work guys.
EL
[+] [-] freels|9 years ago|reply
In short, writes require a global round-trip through the transaction pipeline; reads are local and low latency.
[+] [-] elvinyung|9 years ago|reply
1: https://news.ycombinator.com/item?id=13645876
[+] [-] danthemanvsqz|9 years ago|reply
[+] [-] evanweaver|9 years ago|reply
[+] [-] jazoom|9 years ago|reply
Edit: I misread it. Perhaps instead of inventing your own point system that you have to explain and hope silly people (like me) don't mix up you could take a lesson from Google Cloud and just lay out the pricing in a table. If you ever add another service you'll have to integrate it also into your made up points system.
[+] [-] evanweaver|9 years ago|reply
[+] [-] doublerebel|9 years ago|reply
I see Fauna has temporal queries, but receiving events is strictly pull, there is no push or single feed?
[+] [-] jchrisa|9 years ago|reply
I'm working on a follow up example to this CRUD one, that implements a multi-user TodoMVC, and will use event queries to keep the UI updated between tabs and users. You can see the basic Serverless CRUD starter example here: https://fauna.com/blog/serverless-cloud-database
[+] [-] jchrisa|9 years ago|reply
[1] https://fauna.com/blog/escape-the-cloud-database-trap-with-s...
[2] https://news.ycombinator.com/item?id=13877223
[3] https://serverless.com/blog/faunadb-serverless-authenticatio...
[+] [-] zenithm|9 years ago|reply
[+] [-] snackai|9 years ago|reply
[+] [-] hoodoof|9 years ago|reply
One way to handle it when something strikes you as not correct is to politely ask for clarification.
[+] [-] z3t4|9 years ago|reply
[+] [-] anamoulous|9 years ago|reply
[+] [-] elvinyung|9 years ago|reply
[+] [-] sushisource|9 years ago|reply
[+] [-] jchrisa|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] mring33621|9 years ago|reply
[+] [-] neuronsguy|9 years ago|reply
[+] [-] evanweaver|9 years ago|reply
[+] [-] dsl|9 years ago|reply
[+] [-] jchrisa|9 years ago|reply