top | item 1132540

SQL Databases Are An Overapplied Solution (And What To Use Instead)

56 points| conorgil145 | 16 years ago |adamblog.heroku.com | reply

62 comments

order
[+] wvenable|16 years ago|reply
There's definitely a lack of imagination when it comes to proponents of NoSQL solutions, and this article shows no exception. While we do want to store, for example, an entire e-commerce order in a single operation we also don't necessarily want to retrieve it that way. That type of storage makes otherwise simple operations considerably more difficult. Do you want to know how much you made in sales today? How many of widget #453 are still in stock? The most popular items for sale? The least popular? Start writing code.

The other example of a user profile seems to be the perfect fit for storage in a single table, so I don't why that's there. Now, do you want to know how many users logged in this week? How about you want to delete every account that hasn't been accessed in a year? Much more slow code.

Every article about NoSQL goes on and on about the supposed advantages, but rarely talks about the considerable disadvantages. And honestly, for most people, the advantages are simply not worth the trade off. I fear that we'll have decades of e-commerce stores written with document stores and mountains of code slowly chugging away to calculate the basic stats that any business needs.

[+] viraptor|16 years ago|reply
Did you use any NoSQL solution seriously? Let's see some examples (imaginary python-like interface, so I don't have to be language/backend-specific). Like you say: Start writing code.

Do you want to know how much you made in sales today?

    sum(amount in db.filter(type='order', date=xxx))
    -vs-
    SELECT SUM(amount) FROM order WHERE date=?
How many of widget #453 are still in stock? (IRL it's never that simple, but...)

    db.filter(type='stock_widget', part_id=453)['amount']
    -vs-
    SELECT amount FROM storage WHERE id = 'widget 453'
Most popular:

    for r in db.filter(type='item', date=xxx): histogram[r['part_id']] += 1
    histogram.sort_value()[0].key()
What I wanted to show is - you're writing the same amount of code for both cases. In some databases (like Tyrant) you can also run the script server-side and just report the result if you prefer. Also depending on the database, you don't need to read the whole record every time - you can just request a list of fields in most of them.

Have some fun with a NoSQL database before rejecting it for reasons like the ones you mentioned... It's also not always about processing speed - I could use either solution, but coding for TT is just simpler than for any SQL in most of what I do (see how my db.filter examples give you the solution in the current language, but queries are just... queries that you have to run and retrieve results (I'm ignoring SQL-LINQ now)).

[+] wvenable|16 years ago|reply
From the article, on where relational databases are appropriate:

> Small records with complex, well-defined, highly normalized relationships.

Why do the records need to be small? And honest, in software development, a large amount of your data is going to be well-defined and easily normalized. The author provided 2 examples that would fit perfectly in a relational database.

> The type of queries you will be running on the data is largely unknown up front.

Or the types of queries you are running are more than just retrieving a single record or simple list of records. I'm afraid a very large number of queries fall into this category.

> The data is long-lived: it will be written once, updated infrequently relative to the number of reads/queries, and deleted either never, or many years in the future.

Yes, a relational databases are for storing long-lived data. For temporary data, you could use an in-memory table or just some other solution entirely. There's no need to mix your permanent data with your temporary data. Databases handle writes and deletes extremely well (in bulk even) so I'm not sure what the author was getting at here.

> The database does not need round-the-clock availability (middle-of-the-night maintenance windows are no problem).

What kind of middle-of-the-night maintenance does a relational database need? I've been running at least one database for several years straight without any downtime or maintenance.

> Queries do not need to be particularly fast, they just need to return correct results eventually.

Relational database queries aren't particularly slow -- in fact, RBMS are heavily optimized to return data very quickly. In the vast majority of cases, this is going to be more than fast enough for nearly every application.

> Data integrity is 100% paramount, trumping all other concerns, such as performance and scalability.

Damn straight. I want the data coming from my data store to be 100% correct always. If I need to trade performance for correctness then I can easily add some caching. But I'm not sure how document stores would solve this any differently.

[+] Retric|16 years ago|reply
I can only assume the reason this was at negative 1 was most people interested in this topic are already on the (not just SQL) bandwagon. IMO, scaling is a non issue for most well designed websites and as computers get faster this only becomes more apparent. There is a significant advantage to separating complex sites into independent modular components and a only tiny fraction of sites need to scale beyond this point. When you actually need to expand fine, go down that rabbit hole but, for most people it's a complete waste of time.

PS: I suspect the main problem developers actually have with SQL databases is they there ORM is significantly less powerful than SQL. All to often developers focus on row as object and forget the power of more abstract data structures.

[+] jmm|16 years ago|reply
I think more specific context on "SQL databases don't scale" might be appropriate here, especially for the noob hackers that are figuring out just which technology to use in building their first or second apps. Is it worth ignoring defaults and conventions on a simple Rails app and not use an SQL based db? Probably not? And I reckon that Heroku itself hosts a good deal of apps where "will this db scale?" isn't really relevant. I could be wrong.

Another question worth knowing the answer to is how much of hassle it is to switch horses midstream (from a well normalized SQL db to something else), after there are some data in the tanks.

Maybe this is Adam's recommendation specifically to the dreaming-big community, which I can certainly appreciate. And maybe everyone should be dreaming big.

[+] gridspy|16 years ago|reply
Gridspy monitors power usage in real time. My original plan was to effectively stream the live data through the database as it passed from collection (sensors) to display (on the website). The frequent writing to and polling of the database to keep data "live" was absolutely killing performance.

Now I have moved to a Twisted application that aggregates the data and does occasional writes into the DB. It can answer webserver queries for the latest data out of its internal datastructures and streams live data to the user's browser via Orbited

See http://blog.gridspy.co.nz/2009/09/database-meet... (the database side)

and http://blog.gridspy.co.nz/2009/10/realtime-data... (the whole application structure)

[I posted this on the original site too]

[+] simonw|16 years ago|reply
I liked the definition of "transient data" - I've been promoting Redis for that kind of thing but I didn't have the vocabulary to explain what I meant (I've been staying "stuff like stats and counters"). Defining transient data as being frequently written, infrequently read is useful too.
[+] jazzychad|16 years ago|reply
I have been describing this type of data as "high velocity" to my friends, and they seem to get it.
[+] pvg|16 years ago|reply
Hang on, SQL databases are poor at highly structured things like an order and that's the sort of thing suitable for a document database?
[+] CaptainZapp|16 years ago|reply
It doesn't feel the author has real world experience on relational databases. Let's see :

"Small records with complex, well-defined, highly normalized relationships."

Maybe, but not necessarily. Example? Call Data Records for a phone billing application. While there may be relationships in the inserted rows, even to other databases (i.e. customer information) the data stands pretty much on it's own. The challenge is to get floods of data into a single table, so that it can be later sliced and diced for billing purposes.

"The type of queries you will be running on the data is largely unknown up front."

Excuse me? This is so very wrong. I don't even no where to start. EVERY good RDB application is designed in a way where (ideally) all queries are known up-front. If you have to guarantee response times (i.e. think of a cash withdrawal at an ATM) you absolutely MUST control the queries that run on the db. Ad-hoc analysing and reporting MUST be relegated to dedicated, possibly replicated databases.

"The data is long-lived: it will be written once, updated infrequently relative to the number of reads/queries, and deleted either never, or many years in the future."

Mostly so, but absolutely not necessarily the case. I work on an application where the entire data is toasted after a couple weeks and in fact: today and yesterday would suffice.

"The database does not need round-the-clock availability (middle-of-the-night maintenance windows are no problem). Queries do not need to be particularly fast, they just need to return correct results eventually."

This is so full of crap, I won't even get into it

"Data integrity is 100% paramount, trumping all other concerns, such as performance and scalability"

Yes, 100% integrity is paramount, but most certainly not at the cost of scalability, let alone performance.

Recently, methinks, there are a lot of proponents of new and improved data management capabilities, who see their little walled environment, but seem to have no whatsoever experience running databases in a real business. A normal (even big, huge or multinational business) does not have those "cloud-data-management-requirments" that very, very few companies really have.

[+] blasdel|16 years ago|reply
Consider the source: Heroku needs an alternative to Postgres to be a true competitor to AppEngine.

There's something that never gets brought up in these NoSQL discussions: SQL Databases don't scale down. They aren't very good in multitenant situations where you have a lot of random small-fry users -- you end up just sharding the users across a bunch of different master-slave pairs, and hope that they don't step on each other's toes. Because they take up real resources even if not being used, it's difficult to pull off a freemium model.

[+] silentbicycle|16 years ago|reply
SQLite scales down fine, FWIW. It's just not appropriate past a certain amount of concurrent writes, but it's great for prototyping and smaller services.
[+] asolove|16 years ago|reply
Although this topic is frequently discussed on HN, and the portion about document stores is well-hashed, I found the tidbit on binary assets/file uploads very interesting, as I had only used disk before.
[+] gte910h|16 years ago|reply
Yeah, I was happy to find out about Tokyo Tyrant as well, had never seen it mentioned.
[+] peterwwillis|16 years ago|reply
There is no explanation of why a filesystem is a bad place to store "binary blobs".

If you're collapsing the metrics that you're storing IN SQL there is something really wrong going on.

Logs are OK to store in SQL, assuming you're scraping your logs properly and are logging the proper things. Logging every clickthrough in a relational database is somewhat insane. Logging 10 minutes worth of aggregate clickthroughs is perfectly fine. If you think logs should be a ring buffer I challenge you to tell that to any admin of a system that is subject to laws governing the length of time you must store logs (which is pretty much all e-commerce?).

[+] dangrossman|16 years ago|reply
Meh, I record a couple million page views to a MySQL database (as a new row for every page view) every day, for 6 years. Works fine. I don't expect a million new users to show up any time soon.
[+] jcapote|16 years ago|reply
I agree wholeheartedly with this article except for the part about s3/cloudfront being the only tool for storing binary assets. MongoDB's GridFS extension does a fine job of handling files in the database since they shard and replicate just like database records. As for serving them out, you can run it on a cheap unmetered server using nginx + http://github.com/mdirolf/nginx-gridfs
[+] cookiecaper|16 years ago|reply
I can feel the sincerity of the author, but ultimately I think he's wrong that the solution to this is NoSQL.

It really would be nice to send an e-commerce order as JSON data and have my database know what to do with that. I think we still need the flexibility and power of a relational database behind it, but if someone extended PostgreSQL to take records the way CouchDB or others take records, and taught it how to store into rigid, joinable, relational tables, that would be just great and would help a lot. All of the advanced and relational functionality would still be available when needed, but by default, if one could write and retrieve data in a default format that had been mapped onto tables, etc. previously transparently from the database, that would be awesome.

[+] oomkiller|16 years ago|reply
I think you just described inserting to a SQL view. Never done it myself, but sounds helpful.
[+] sant0sk1|16 years ago|reply
Does Heroku offer any alternate persistence engines? I thought they were PostgreSQL only.
[+] davidnelson|16 years ago|reply
I love the document record pattern. I use this on google app engine, which is a tad tricky to use as a relational db. It actually ends up being way simpler and faster. For instance, you can model an app with just one "table" (Kind). You can put say a user id primary key, then have blob binary data fields which store your serialized document records. The reason it's so fast is all you're doing is a simple index scan to lookup all the data you want. It also makes it really simple to store items into memcached/memcache.
[+] thewileyone|16 years ago|reply
I disagree with the comment "SQL databases don't scale". If you've got a decent database architect, scalability (not capacity), stops being an issue.
[+] marshallp|16 years ago|reply
These noSql people are missing the point of relational modeling, that you can easily incrementally evolve your data model. It's why object databases never caught on.

SQL databases are absolutely beautiful and elegant when you think of them in terms of the codd relational model, in my opinion the best thing that computer science has produced so far.

The only limitation of relational databases currently is their lack of automatic infinite horizontal scaling on commodity servers, but hopefully someone will solve that soon.

[+] cduan|16 years ago|reply
Well, is this a complaint about relational SQL databases? The only issues the author has that actually are tied to the SQL language are #2 (queries are largely unknown up front) and possibly #1 (small records with normalized relationships). Most of the other complaints have to do with ACID data consistency and the overhead it imposes in simple implementations (e.g. no row-level locking).
[+] gnaritas|16 years ago|reply
And you can easily and incrementally evolve your data model in an object database too with the added benefit that the application is updated to match the new data model at the same time.

> SQL databases are absolutely beautiful and elegant when you think of them in terms of the codd relational model

And horrible and brittle in terms of the application model, which in 99% of cases is not relational.

[+] peterwwillis|16 years ago|reply
if by "solve" you mean "completely redesign modern RDBMS's" you may be waiting for a while. i'm no expert, but my limited understanding is that most of these services provide a singular interface to a database. to be "automatic infinite horizontal scaling" they'd need to support an infinite number of interfaces on any of these commodity servers. i could be wrong, because again i don't fully understand them, but i think that means basically implementing the equivalent of an 'intelligent' distributed parallel fault-tolerant file system. at that point i figure you could add "lots of glue" (heh, understatement?) to MemcacheDB, MemcacheD, MemcacheQ and some other crap and get something similar. but i could be crazy.
[+] w3matter|16 years ago|reply
Yeah, Heroku should put up or shut up until they start offering alternate NoSQL databases (and no, MongoHQ is not an alternative because of network latency).

Interesting article though.

[+] cracell|16 years ago|reply
As they are hosted on EC2 couldn't you use Amazon Cloudfront with them without too much trouble and with no latency issues? http://aws.amazon.com/cloudfront/

The main issue would be integrating Cloudfront with Rails not getting it to run on Heroku. (A quick search doesn't show much support for using Cloudfront with Rails in general)

[+] benwyrosdick|16 years ago|reply
If Heroku offers their own NoSQL alternative, they will put it on a different EC2 instance and have the same network latency so MongoHQ seems like an equal solution. They provide private IP addresses to connect to so it is the same as running it on your own separate EC2 instance. How does anyone separate their app and db server without creating latency in your opinion?