Where is taligent when we need him? MongoDB is pretty impressive considering its age to the market, but I'd still go with the relational version of PostgreSQL anyday. Not because of the benchmarks, but in general, most use-cases are covered well enough by our relational db's like MySQL and PostgreSQL. The most expensive mistake one could make is using a NoSQL db where an SQL db fits perfectly. I admittedly made this mistake and learnt a LOT. Now, these benchmarks are just some extra spice for sticking with my lovely PostgresSQL :)
I think it's easy to forget that mongo did hit an under-served sweetspot when it turned up in 2008ish. But SSDs and the explosion in per-core RAM density have steadily rendered those advantages less compelling.
I imagine that as time goes on the PostgreSQL team will do their usual tortoise impression, steadily closing the distance on the NoSQL pioneers (and they deserve credit for reopening and re-exploring this chapter in database history).
Why do you need me ? I am just that crazy troll who believes that no database is perfect for all use cases and that making blind generalisations like "all NoSQL is bad" is stupid. Especially since MongoDB is as different from Cassandra or Riak as it is from Spanner as that is from the other 120+ odd NoSQL databases.
And if benchmarks impress you I suggest you take a look at Cassandra sometime.
So, the problem that I have with PostgreSQL isn't that it doesn't have every datastore under the sun, but rather the lack of automated distribution, and fault-tolerance.
This is not a hard thing to build though (continue reading please). I don't mean that it's an easy task, but that the semantics by which you may be able to build a fault-tolerant, distributed database on top of Postgresql are pretty straightforward.
1. Partitioning:
So, Postgresql has multiple notions of partitions. Ideally, this would be done at the database level in Postgres, but the idea is that instead of having your entire database fail at once, a transparent layer in which you SET PARTITION_KEY=X before querying would make this pretty straightforward. It would be nice if there was a semi-transparent proxy layer that this.
2. Replication / Fault-tolerance:
Postgresql today has built-in synchronous master/slave replication. It would be nice if someone built a automated failover system on top this. Alternatively, it would be really interesting if someone had multiple postgresql instances, and then build a WAL log per partition, and performed 3PC between N replicas, and you could have a fully peer-to-peer postgres. I imagine these transactions would either need to be PL/SQL, or fully written-out, staged serializable SQL queries.
3. Co-processors:
One of the biggest benefits a distributed, fault-tolerant Postgresql setup would give you is the ability to collocate your application with the data. Instead of having to do read, modify, write, you can write complex business logic that can run on whichever Postgres partition's master node, and the data doesn't have to go over the network.
I feel by introducing some basic wrappers around postgres to support partitioned, master-slave topologies, and perhaps a slightly different query interface to support this transparently, PostgreSQL could effectively replace most modern NoSQL databases.
Hot Standby in postgres is pretty damn good for replication (like you said making failover occur properly takes a little bit of effort but its not too hard).
Also pgpool2 helps a lot with partitioning and parallel queries.
For a relational db, postgres does a pretty good job at distribution and fault tolerance.
PostgreSQL today has synchronous and asynchronous master/slave replication built in. The dev team is working on master/master replication (or peer-to-peer, as you called it).
It is transparent, consistent, load-balancing and fault-tolerant cluster. Though some of the features of PostgreSQL are not supported yet, eg. triggers.
This is pretty much the reason we're using Postgres for our document storage. Admittedly, our doc store isn't really extensive and it's mostly for internal use (with limited client access), but we just didn't see why it was necessary to go with a new database when the existing stack worked well.
YMMV, but we've yet to come across an instance where a pure document DB was necessary or desired, even to store documents (which from my perspective seems nothing more than a flattened EAV model).
Do you have any thoughts on expression indexes? Assumed I would have to pull out fields from the document to index reasonably on pgsql, but this presentation at least shows that simple expression indexes perform well. I wonder about compound, nested, etc.
I've tried both MongoDB+Mongoid and Postgresql+hstore as a way to do better inheritance than STI in ActiveRecord.
Both have deficiencies: Mongo is not mature enough for an old lag like me to trust (and is incompatible with some Rails extensions), and hstore had problems with key deletion and cannot do object references or indeed types other than strings.
I want to admit that conclusion in presentation can be interpreted vise versa: MongoDB almost is fast as PostgreSQL. Also, notice that comparsion was done with Mongo 2.2, in 2.4 JS engine was changed to V8, so there is a lot of speed improvements. And compare PG have 18 years of development vs 5 of MongoDB. So what I want to say? Difference is not so big after all.
PostgreSQL hasn't spent 18 years pitching itself as a pure performance play, though. It started as a research project into the object-relational model and was adapted to be a safe and featuresome database first and performant second.
Mongo's performance will converge downwards as people demand more features and safety guarantees.
If however you want the original premise -- memory-only, schemaless, auto-sharded document database -- Mongo is still the easier pick.
It's a cliche, but it's easy to:
1. overestimate the size of your data, (especially since relational models by design squeeze out all duplicated statements of data)
2. overestimate how much traffic they need to serve
3. underestimate what can be achieved by stock RDBMSes with modern hardware
4. underestimate the deferred development costs of schemaless data and
5. underestimate the costs of avoiding durable storage.
It's a really hard field to make generalisations about right at the moment. The algo-economics of hardware (especially SSDs) and software have been changing so fast that any generalisation formed even 2 years ago is unreliable.
Speaking only for myself, I need a lot of talking down from my data-safety/data-integrity tree. I don't trust myself. That's why I use tools that slow me down.
I suppose the takeaway is meant to be: why bother having two separate data storage systems (relational + schemaless) if only one will do? I mean, if Mongo isn't offering much (if any) performance gain or schemaless flexibility... what's the advantage? With Postgres, you can have your cake and eat it: your transactional data gets the benefits of a strict schema and referential integrity, but you also get an "escape hatch" out of the relational model for when that is needed. (And 18 years of proven code, deployed at massive scale, that definitely isn't going to lose your data.)
Side note: what's with Mongo using so much storage? Perhaps an append-only log without garbage collection?
>I want to admit that conclusion in presentation can be interpreted vise versa: MongoDB almost is fast as PostgreSQL.
Which given how little MongoDB does, and how it can lose data unless you cripple the performance even more (ie. with waiting for successful write status), is impressive in itself.
>Also, notice that comparsion was done with Mongo 2.2, in 2.4 JS engine was changed to V8, so there is a lot of speed improvements.
Not really. The JS engine was never the bottleneck in MongoDB. The demanding stuff is done in the C/C++ core.
>And compare PG have 18 years of development vs 5 of MongoDB.
Not really relevant. PG had 18 years of development implementing hundreds of things MongoDB doesn't do, not merely doing the same schema-less stuff and tuning it for speed.
It's only one of several interesting comparisons in the slides, but I think we need more data points on the disk space to make use of the data. Does it grow linearly? probably, but is there a +x in there? Mongo has some big upfront disk allocations. Were those considered somehow?
When the benchmark shows postgres can do the same as mongo, only faster, then why aren't we seeing more hybrid solutions? Online retailers would be better off using relational storage for monetary transactions and time series, and json storage for document like data.
You could start with a schema that's only a primary key and hstore for all other data (like a fully schemaless database), but over time extract important fields from the hstore into columns to add indexes, constraints or joins.
Depends what you mean by "NoSQL". MongoDB is pretty terrible overall, so don't even bother.
Systems like Dynamo/Riak/Cassandra sacrifice consistency (most of the time) for performance, largely because a distributed setup (sharding in particular) is easier when you don't guarantee consistency. Note that none of them sacrifice durability.
It is possible to have a distributed and consistent database, but you either have a bottleneck (master nodes in BigTable/HBase), require significant infrastructure (Google Spanner) or make schema changes very hard (Hyperdex). It's a hard problem.
There is still some way to go to support better indexing of json fields, but I'm pretty excited about these developments. In the future you might see schemaless and schema driven data living side by side on postgres.
There is also another presentation on the use of PL/V8 (on Heroku) at the PostgreSQL wiki - from last week's NYC talks: http://plv8-talk.herokuapp.com/#1
Sadly it's not a fair fight, of course it'll beat Mongo (even though Postgres is ACID).
I would be more interested in comparing the biggest machine you can get on AWS running Postgres with a cluster of smaller machines running Cassandra. I expect Cassandra to win, but of course it's not consistent.
would it be possible to build an extension to expose a mongo compatible interface in postgres - essentially being able to use all the ORM/code built around mongo for postgres, seamlessly.
[+] [-] neya|13 years ago|reply
[+] [-] jacques_chester|13 years ago|reply
I imagine that as time goes on the PostgreSQL team will do their usual tortoise impression, steadily closing the distance on the NoSQL pioneers (and they deserve credit for reopening and re-exploring this chapter in database history).
[+] [-] lucian1900|13 years ago|reply
[+] [-] taligent|13 years ago|reply
And if benchmarks impress you I suggest you take a look at Cassandra sometime.
http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...
[+] [-] sargun|13 years ago|reply
This is not a hard thing to build though (continue reading please). I don't mean that it's an easy task, but that the semantics by which you may be able to build a fault-tolerant, distributed database on top of Postgresql are pretty straightforward.
1. Partitioning: So, Postgresql has multiple notions of partitions. Ideally, this would be done at the database level in Postgres, but the idea is that instead of having your entire database fail at once, a transparent layer in which you SET PARTITION_KEY=X before querying would make this pretty straightforward. It would be nice if there was a semi-transparent proxy layer that this.
2. Replication / Fault-tolerance: Postgresql today has built-in synchronous master/slave replication. It would be nice if someone built a automated failover system on top this. Alternatively, it would be really interesting if someone had multiple postgresql instances, and then build a WAL log per partition, and performed 3PC between N replicas, and you could have a fully peer-to-peer postgres. I imagine these transactions would either need to be PL/SQL, or fully written-out, staged serializable SQL queries.
3. Co-processors: One of the biggest benefits a distributed, fault-tolerant Postgresql setup would give you is the ability to collocate your application with the data. Instead of having to do read, modify, write, you can write complex business logic that can run on whichever Postgres partition's master node, and the data doesn't have to go over the network.
I feel by introducing some basic wrappers around postgres to support partitioned, master-slave topologies, and perhaps a slightly different query interface to support this transparently, PostgreSQL could effectively replace most modern NoSQL databases.
[+] [-] bbromhead|13 years ago|reply
Also pgpool2 helps a lot with partitioning and parallel queries.
For a relational db, postgres does a pretty good job at distribution and fault tolerance.
[+] [-] atsaloli|13 years ago|reply
[+] [-] karavelov|13 years ago|reply
It is transparent, consistent, load-balancing and fault-tolerant cluster. Though some of the features of PostgreSQL are not supported yet, eg. triggers.
[+] [-] eksith|13 years ago|reply
YMMV, but we've yet to come across an instance where a pure document DB was necessary or desired, even to store documents (which from my perspective seems nothing more than a flattened EAV model).
[+] [-] aaronyo|13 years ago|reply
[+] [-] jtchang|13 years ago|reply
For me I still feel MongoDB is a semi niche solution. I rather go with a standard SQL database unless there is a specific reason why you need NoSQL.
[+] [-] inopinatus|13 years ago|reply
Both have deficiencies: Mongo is not mature enough for an old lag like me to trust (and is incompatible with some Rails extensions), and hstore had problems with key deletion and cannot do object references or indeed types other than strings.
[+] [-] karteek|13 years ago|reply
[+] [-] mh-|13 years ago|reply
"It's time for us to switch to something fresher. I personally would have preferred XSLT, but QUEL is almost as good."
[+] [-] hannibalhorn|13 years ago|reply
[+] [-] adrianhoward|13 years ago|reply
[+] [-] Evbn|13 years ago|reply
[+] [-] xenator|13 years ago|reply
[+] [-] jacques_chester|13 years ago|reply
Mongo's performance will converge downwards as people demand more features and safety guarantees.
If however you want the original premise -- memory-only, schemaless, auto-sharded document database -- Mongo is still the easier pick.
It's a cliche, but it's easy to:
1. overestimate the size of your data, (especially since relational models by design squeeze out all duplicated statements of data)
2. overestimate how much traffic they need to serve
3. underestimate what can be achieved by stock RDBMSes with modern hardware
4. underestimate the deferred development costs of schemaless data and
5. underestimate the costs of avoiding durable storage.
It's a really hard field to make generalisations about right at the moment. The algo-economics of hardware (especially SSDs) and software have been changing so fast that any generalisation formed even 2 years ago is unreliable.
Speaking only for myself, I need a lot of talking down from my data-safety/data-integrity tree. I don't trust myself. That's why I use tools that slow me down.
[+] [-] kiwidrew|13 years ago|reply
Side note: what's with Mongo using so much storage? Perhaps an append-only log without garbage collection?
[+] [-] coldtea|13 years ago|reply
Which given how little MongoDB does, and how it can lose data unless you cripple the performance even more (ie. with waiting for successful write status), is impressive in itself.
>Also, notice that comparsion was done with Mongo 2.2, in 2.4 JS engine was changed to V8, so there is a lot of speed improvements.
Not really. The JS engine was never the bottleneck in MongoDB. The demanding stuff is done in the C/C++ core.
>And compare PG have 18 years of development vs 5 of MongoDB.
Not really relevant. PG had 18 years of development implementing hundreds of things MongoDB doesn't do, not merely doing the same schema-less stuff and tuning it for speed.
[+] [-] aaronyo|13 years ago|reply
[+] [-] thomasfl|13 years ago|reply
[+] [-] ProblemFactory|13 years ago|reply
With http://www.postgresql.org/docs/9.1/static/hstore.html you can put schemaless documents into the same database, even into the same table row as your usual relational data.
You could start with a schema that's only a primary key and hstore for all other data (like a fully schemaless database), but over time extract important fields from the hstore into columns to add indexes, constraints or joins.
[+] [-] malkia|13 years ago|reply
[+] [-] guilloche|13 years ago|reply
If the data is well structured and no join is needed, can I assume that relational database should have better performance than no-SQL?
[+] [-] lucian1900|13 years ago|reply
Systems like Dynamo/Riak/Cassandra sacrifice consistency (most of the time) for performance, largely because a distributed setup (sharding in particular) is easier when you don't guarantee consistency. Note that none of them sacrifice durability.
It is possible to have a distributed and consistent database, but you either have a bottleneck (master nodes in BigTable/HBase), require significant infrastructure (Google Spanner) or make schema changes very hard (Hyperdex). It's a hard problem.
[+] [-] cinbun8|13 years ago|reply
[+] [-] fpp|13 years ago|reply
The original slides are at: https://wiki.postgresql.org/images/b/b4/Pg-as-nosql-pgday-fo...
There is also another presentation on the use of PL/V8 (on Heroku) at the PostgreSQL wiki - from last week's NYC talks: http://plv8-talk.herokuapp.com/#1
[+] [-] disbelief|13 years ago|reply
[+] [-] lucian1900|13 years ago|reply
I would be more interested in comparing the biggest machine you can get on AWS running Postgres with a cluster of smaller machines running Cassandra. I expect Cassandra to win, but of course it's not consistent.
[+] [-] lifeisstillgood|13 years ago|reply
[+] [-] forgotAgain|13 years ago|reply
Anyone seeing those numbers for any database that synchronously writes the log to disk?
[+] [-] sendob|13 years ago|reply
[+] [-] antimora|13 years ago|reply
[+] [-] apunic|13 years ago|reply
[+] [-] ogdr|13 years ago|reply
[+] [-] sandGorgon|13 years ago|reply
[+] [-] lucian1900|13 years ago|reply
[+] [-] grncdr|13 years ago|reply