PostgreSQL as Schemaless Database [pdf]

[+] neya|13 years ago|reply

Where is taligent when we need him? MongoDB is pretty impressive considering its age to the market, but I'd still go with the relational version of PostgreSQL anyday. Not because of the benchmarks, but in general, most use-cases are covered well enough by our relational db's like MySQL and PostgreSQL. The most expensive mistake one could make is using a NoSQL db where an SQL db fits perfectly. I admittedly made this mistake and learnt a LOT. Now, these benchmarks are just some extra spice for sticking with my lovely PostgresSQL :)

[+] jacques_chester|13 years ago|reply

I think it's easy to forget that mongo did hit an under-served sweetspot when it turned up in 2008ish. But SSDs and the explosion in per-core RAM density have steadily rendered those advantages less compelling.

I imagine that as time goes on the PostgreSQL team will do their usual tortoise impression, steadily closing the distance on the NoSQL pioneers (and they deserve credit for reopening and re-exploring this chapter in database history).

[+] lucian1900|13 years ago|reply

You managed to choose the worst new database out there to call impressive :) Cassandra is impressive. Mongo? Not so much.

[+] taligent|13 years ago|reply

Why do you need me ? I am just that crazy troll who believes that no database is perfect for all use cases and that making blind generalisations like "all NoSQL is bad" is stupid. Especially since MongoDB is as different from Cassandra or Riak as it is from Spanner as that is from the other 120+ odd NoSQL databases.

And if benchmarks impress you I suggest you take a look at Cassandra sometime.

http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...

[+] sargun|13 years ago|reply

So, the problem that I have with PostgreSQL isn't that it doesn't have every datastore under the sun, but rather the lack of automated distribution, and fault-tolerance.

This is not a hard thing to build though (continue reading please). I don't mean that it's an easy task, but that the semantics by which you may be able to build a fault-tolerant, distributed database on top of Postgresql are pretty straightforward.

1. Partitioning: So, Postgresql has multiple notions of partitions. Ideally, this would be done at the database level in Postgres, but the idea is that instead of having your entire database fail at once, a transparent layer in which you SET PARTITION_KEY=X before querying would make this pretty straightforward. It would be nice if there was a semi-transparent proxy layer that this.

2. Replication / Fault-tolerance: Postgresql today has built-in synchronous master/slave replication. It would be nice if someone built a automated failover system on top this. Alternatively, it would be really interesting if someone had multiple postgresql instances, and then build a WAL log per partition, and performed 3PC between N replicas, and you could have a fully peer-to-peer postgres. I imagine these transactions would either need to be PL/SQL, or fully written-out, staged serializable SQL queries.

3. Co-processors: One of the biggest benefits a distributed, fault-tolerant Postgresql setup would give you is the ability to collocate your application with the data. Instead of having to do read, modify, write, you can write complex business logic that can run on whichever Postgres partition's master node, and the data doesn't have to go over the network.

I feel by introducing some basic wrappers around postgres to support partitioned, master-slave topologies, and perhaps a slightly different query interface to support this transparently, PostgreSQL could effectively replace most modern NoSQL databases.

[+] bbromhead|13 years ago|reply

Hot Standby in postgres is pretty damn good for replication (like you said making failover occur properly takes a little bit of effort but its not too hard).

Also pgpool2 helps a lot with partitioning and parallel queries.

For a relational db, postgres does a pretty good job at distribution and fault tolerance.

[+] atsaloli|13 years ago|reply

PostgreSQL today has synchronous and asynchronous master/slave replication built in. The dev team is working on master/master replication (or peer-to-peer, as you called it).

[+] karavelov|13 years ago|reply

Then you may be interested in PostgreSQL-XC: http://postgres-xc.sourceforge.net/

It is transparent, consistent, load-balancing and fault-tolerant cluster. Though some of the features of PostgreSQL are not supported yet, eg. triggers.

[+] eksith|13 years ago|reply

This is pretty much the reason we're using Postgres for our document storage. Admittedly, our doc store isn't really extensive and it's mostly for internal use (with limited client access), but we just didn't see why it was necessary to go with a new database when the existing stack worked well.

YMMV, but we've yet to come across an instance where a pure document DB was necessary or desired, even to store documents (which from my perspective seems nothing more than a flattened EAV model).

[+] aaronyo|13 years ago|reply

Do you have any thoughts on expression indexes? Assumed I would have to pull out fields from the document to index reasonably on pgsql, but this presentation at least shows that simple expression indexes perform well. I wonder about compound, nested, etc.

[+] jtchang|13 years ago|reply

Really great presentation/slidedeck.

For me I still feel MongoDB is a semi niche solution. I rather go with a standard SQL database unless there is a specific reason why you need NoSQL.

[+] inopinatus|13 years ago|reply

I've tried both MongoDB+Mongoid and Postgresql+hstore as a way to do better inheritance than STI in ActiveRecord.

Both have deficiencies: Mongo is not mature enough for an old lag like me to trust (and is incompatible with some Rails extensions), and hstore had problems with key deletion and cannot do object references or indeed types other than strings.

[+] karteek|13 years ago|reply

First thing I remembered after reading the title is http://pgsnake.blogspot.in/2010/04/postgres-91-release-theme...

[+] mh-|13 years ago|reply

haha, thanks for posting that. i enjoyed:

"It's time for us to switch to something fresher. I personally would have preferred XSLT, but QUEL is almost as good."

[+] hannibalhorn|13 years ago|reply

The new JSON functions have landed in the development version, for anyone interested in seeing the future syntax: http://www.postgresql.org/docs/devel/static/functions-json.h...

[+] adrianhoward|13 years ago|reply

Damn. The "XML. It seemed a good idea at the time" slide caused me to spray coffee on my laptop.

[+] Evbn|13 years ago|reply

See also "XML literals as a native syntax in Scala."

[+] xenator|13 years ago|reply

I want to admit that conclusion in presentation can be interpreted vise versa: MongoDB almost is fast as PostgreSQL. Also, notice that comparsion was done with Mongo 2.2, in 2.4 JS engine was changed to V8, so there is a lot of speed improvements. And compare PG have 18 years of development vs 5 of MongoDB. So what I want to say? Difference is not so big after all.

[+] jacques_chester|13 years ago|reply

PostgreSQL hasn't spent 18 years pitching itself as a pure performance play, though. It started as a research project into the object-relational model and was adapted to be a safe and featuresome database first and performant second.

Mongo's performance will converge downwards as people demand more features and safety guarantees.

If however you want the original premise -- memory-only, schemaless, auto-sharded document database -- Mongo is still the easier pick.

It's a cliche, but it's easy to:

1. overestimate the size of your data, (especially since relational models by design squeeze out all duplicated statements of data)

2. overestimate how much traffic they need to serve

3. underestimate what can be achieved by stock RDBMSes with modern hardware

4. underestimate the deferred development costs of schemaless data and

5. underestimate the costs of avoiding durable storage.

It's a really hard field to make generalisations about right at the moment. The algo-economics of hardware (especially SSDs) and software have been changing so fast that any generalisation formed even 2 years ago is unreliable.

Speaking only for myself, I need a lot of talking down from my data-safety/data-integrity tree. I don't trust myself. That's why I use tools that slow me down.

[+] kiwidrew|13 years ago|reply

I suppose the takeaway is meant to be: why bother having two separate data storage systems (relational + schemaless) if only one will do? I mean, if Mongo isn't offering much (if any) performance gain or schemaless flexibility... what's the advantage? With Postgres, you can have your cake and eat it: your transactional data gets the benefits of a strict schema and referential integrity, but you also get an "escape hatch" out of the relational model for when that is needed. (And 18 years of proven code, deployed at massive scale, that definitely isn't going to lose your data.)

Side note: what's with Mongo using so much storage? Perhaps an append-only log without garbage collection?

[+] coldtea|13 years ago|reply

>I want to admit that conclusion in presentation can be interpreted vise versa: MongoDB almost is fast as PostgreSQL.

Which given how little MongoDB does, and how it can lose data unless you cripple the performance even more (ie. with waiting for successful write status), is impressive in itself.

>Also, notice that comparsion was done with Mongo 2.2, in 2.4 JS engine was changed to V8, so there is a lot of speed improvements.

Not really. The JS engine was never the bottleneck in MongoDB. The demanding stuff is done in the C/C++ core.

>And compare PG have 18 years of development vs 5 of MongoDB.

Not really relevant. PG had 18 years of development implementing hundreds of things MongoDB doesn't do, not merely doing the same schema-less stuff and tuning it for speed.

[+] aaronyo|13 years ago|reply

It's only one of several interesting comparisons in the slides, but I think we need more data points on the disk space to make use of the data. Does it grow linearly? probably, but is there a +x in there? Mongo has some big upfront disk allocations. Were those considered somehow?

[+] thomasfl|13 years ago|reply

When the benchmark shows postgres can do the same as mongo, only faster, then why aren't we seeing more hybrid solutions? Online retailers would be better off using relational storage for monetary transactions and time series, and json storage for document like data.

[+] ProblemFactory|13 years ago|reply

The great thing about PostgreSQL is that it is hybrid.

With http://www.postgresql.org/docs/9.1/static/hstore.html you can put schemaless documents into the same database, even into the same table row as your usual relational data.

You could start with a schema that's only a primary key and hstore for all other data (like a fully schemaless database), but over time extract important fields from the hstore into columns to add indexes, constraints or joins.

[+] malkia|13 years ago|reply

Maybe sharding?

[+] guilloche|13 years ago|reply

I am always wondering why no-sql has better performance than SQL if we have same simple use case (no join etc.).

If the data is well structured and no join is needed, can I assume that relational database should have better performance than no-SQL?

[+] lucian1900|13 years ago|reply

Depends what you mean by "NoSQL". MongoDB is pretty terrible overall, so don't even bother.

Systems like Dynamo/Riak/Cassandra sacrifice consistency (most of the time) for performance, largely because a distributed setup (sharding in particular) is easier when you don't guarantee consistency. Note that none of them sacrifice durability.

It is possible to have a distributed and consistent database, but you either have a bottleneck (master nodes in BigTable/HBase), require significant infrastructure (Google Spanner) or make schema changes very hard (Hyperdex). It's a hard problem.

[+] cinbun8|13 years ago|reply

There is still some way to go to support better indexing of json fields, but I'm pretty excited about these developments. In the future you might see schemaless and schema driven data living side by side on postgres.

[+] fpp|13 years ago|reply

Would be interesting to know which version of MongoDB was used (2.4 also uses V8).

The original slides are at: https://wiki.postgresql.org/images/b/b4/Pg-as-nosql-pgday-fo...

There is also another presentation on the use of PL/V8 (on Heroku) at the PostgreSQL wiki - from last week's NYC talks: http://plv8-talk.herokuapp.com/#1

[+] disbelief|13 years ago|reply

In one slide it says he used MongoDB 2.2 on his mac.

[+] lucian1900|13 years ago|reply

Sadly it's not a fair fight, of course it'll beat Mongo (even though Postgres is ACID).

I would be more interested in comparing the biggest machine you can get on AWS running Postgres with a cluster of smaller machines running Cassandra. I expect Cassandra to win, but of course it's not consistent.

[+] lifeisstillgood|13 years ago|reply

Well that's Monday's afternoon decided !

[+] forgotAgain|13 years ago|reply

Almost 6,000 synchronous commits per second to a 7,200 rpm disk seems high to me.

Anyone seeing those numbers for any database that synchronously writes the log to disk?

[+] sendob|13 years ago|reply

http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-t... may interest you

[+] antimora|13 years ago|reply

The problem with this scheme is that Postgres does not support JSON indexing and partial updates which are important for high transaction databases.

[+] apunic|13 years ago|reply

When I checked hstore's functionality the last time (~6 months ago) it was rather limited compared to Mongo's full-blown API. Any new observations?

[+] ogdr|13 years ago|reply

MongoDB's native storage type is not JSON, but BSON. One should know this if they're about to wade in a deep river, trying to argue against it.

[+] sandGorgon|13 years ago|reply

would it be possible to build an extension to expose a mongo compatible interface in postgres - essentially being able to use all the ORM/code built around mongo for postgres, seamlessly.

[+] lucian1900|13 years ago|reply

Yes, but why bother? There are already so many good SQL clients out there, like SQLAlchemy.

[+] grncdr|13 years ago|reply

Yes, I don't think it would even be that difficult if you used plv8js, but it would (probably?) be relatively slow.

89 comments