NoSQL vs. RDBMS: Let the flames begin

[+] JunkDNA|16 years ago|reply

I completely do not understand this sort of thinking. It's like saying, "All you people driving motor vehicles: trains are better". It's a meaningless comparison. NoSQL systems are really great for certain things. RDBMS systems are really great for certain things. What I would really like to see is if all this effort writing these kinds of articles went into well-thought out pieces that discusses very specific cases NoSQL backends provide an advantage and why. Use cases that are not a blog, digg, or twitter clone would be most helpful. Some of us have to work outside the bay area for companies like banks, insurance companies, hospitals, etc... Can I use Cassandra as a data warehouse for electronic medical records? Hell if I know, without actually having to learn it and implement it to see if it works.

[+] stan_rogers|16 years ago|reply

If you look carefully, you'll find that a whole heck of a lot of those banks, insurance companies, and so forth are already using something that falls into the NoSQL realm -- Lotus Notes/Domino. The use case for a document-based, schemaless database is already well understood, if not always by the developers using it (or by the developers who grew up on a strict diet of relational databases).

It boils down to this: do you need addressing that would not be well-served by structured, tabular data? Relational databases are (or should be) extremely efficient at accessing data that can be arranged neatly into tables and read using the equivalent of pointer math. They suck, though, when the data is stored outside of the table (as large text fields, blobs, and so forth would be). The more heterogeneous and variable the data, the worse performance gets. A schemaless database gives up efficiency with tabular data, often precluding efficient runtime joins and, consequently, multiple orthogonal data access paths (data normalization) for a more efficient generalized access.

As a thought experiment, imagine a data quagmire where, in order to make the data fit a SQL table, every cell in every row in the database would need to be varchar(17767) and may contain multiple character-delimited values (or not --each record can be unique) and every row can have an arbitrary number of cells. That's what schemaless data can, and often does, look like -- and something like Notes or CouchDB can work with that comfortably and with an efficiency that cannot be matched by a relational representation of the same data.

[+] suhail|16 years ago|reply

Here's the issue with this entire topic:

1. It's not just which can do reads/writes better--it's not apples to apples, it's apples to oranges. You're right.

2. If you look the type of reads Cassandra can do and the type of reads MySQL can do. MySQL blows Casandra out of the water in terms of flexibility. (at a cost obviously)

MySQL simply has the best flexibility. Cassandra has a ways to go.

3. In Cassandra you essentially have to hack your data model to fit the data structure to make things work and if you decide one day you read things differently it's not always easy. You have to massively denormalize. (but hey disks are cheap as they say)

In a nutshell, use the best tool for the job. Cassandra happens to just fit lots of use cases so it's worth looking at. I don't think it would be best for a company to port everything over, MySQL is still very good at many things and the flexibility in reads is worth the cost.

You shouldn't wait for someone to have to tell you it's going to work for y use case. It really matters what queries you end up asking your data store and Cassandra really shines for simplistic ones and be hacked for complex ones. You need to just dive into the docs and http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-mode... is a great resource.

FYI: I love what Casandra does and I think it's the best of the NoSQL options out there.

[+] Semiapies|16 years ago|reply

Except of course, TFA is not that, but instead, "Train riders, we're not complete idiots for driving cars."

[+] siculars|16 years ago|reply

So my thinking on this is that the way these NoSQL systems will make their way into Healthcare is through the great backdoor of analytics. Virtually all available EHR/EMR systems available today allow various mechanisms for data retrieval based on primary indices. Unfortunately that avenue does not lend itself to secondary data reuse which will become more and more valuable as Institutions realize their data is valuable not only for academic research but for operational efficiencies re. money in the kitty.

Even more unfortunate still is that virtually all these installed systems do not have the performance capacity or advanced search-ability to adequately mine this growing horde of data. Administrators, under capex constraints, do not allocate resources for secondary systems which would duplicate data for mining purposes while alleviating strain on the principle production system. The no money budget problem will lead in house programmers to build these research systems on top of open source NoSQL solutions. There, the technology will prove itself.

Additionally, "NoSQL" comes in different flavors. Generally, all of them forgo the Consistency in CAP for Availability and Partition tolerance, which is fine for many use cases - just not primary medical data acquisition use cases. As the field matures programmers and system designers will learn how to make this work better to the point where one day NoSQL systems may be used as the primary data repository for medical data. However, that day has not come. For instance, Riak allows you to tweak knobs in order to favor of certain aspects of CAP theorem at different times while in production (specifically the w and dw parameters, http://blog.basho.com/2010/03/19/schema-design-in-riak---int...). But having just started working with Riak in the last month or two I would still only use it as an analytics tool exposing my medical record data to m/r jobs at this point. And before jbellis smacks me, I think Cassandra is awesome and I'm looking forward to spending some time with it but I'm still not putting my med app data in Casssandra just yet as a primary data store.

/Disclaimer. I work for a major University Medical Center and write business web applications in this area./

[+] nathanh|16 years ago|reply

I'd really like to see a set of use cases for each type of system too. I recently compared NoSQL systems in terms of the CAP theorem and the underlying data model (discussion here http://news.ycombinator.com/item?id=1190772). I haven't seen enough of these systems in production, but it would be great to see examples of each combination of CA, AP, or CP systems with relational, key-value, column-oriented, document-oriented, and graph (not featured in my post) data models.

[+] japherwocky|16 years ago|reply

yeah no but. if you need to move a lot of shit, trains are better.

flame on? maybe you don't understand because your startup doesn't have a lot of traffic?

you even admit that you don't really understand how the nosql dbs even work!

[+] dasil003|16 years ago|reply

Can I use Cassandra as a data warehouse for electronic medical records?

I don't think people building those systems are supposed to talk about them...

[+] unknown|16 years ago|reply

[deleted]

[+] antirez|16 years ago|reply

I don't want to enter into the details of the post, but it is impossible for me avoiding a reaction to this: "Do you honestly think that the PhDs at Google, Amazon, Twitter, Digg, and Facebook created Cassandra, BigTable, Dynamo, etc. when they could have just used a RDBMS instead?"

It is really impossible to argument something based on the fact that people that are supposed to be very smart are doing it. The only way to support arguments is by showing facts...

shameless plug as I don't want to post a message just to say this, but isn't HN too slow lately? I'm at the point that I visit the site less often than I was used to do as I don't want to experience the delay.

[+] aphyr|16 years ago|reply

I think in some cases this kind of appeal to authority can be valid.

Facebook has absolutely insane sparse matrices to handle. They handle enourmous volumes of traffic querying very specific (read: not cachable between users) datasets. Moreover, they've already invested mind-boggling amounts of capital into their stack. Same goes for Amazon with Dynamo. These people operate on scales that startups like us can't even comprehend; and they've found it worthwhile to write their own datastores for those scenarios. Moreover, their use of those databases has apparently contributed to their success. That, to me, is strongly suggestive evidence.

That and HA/fault-tolerance is a no-brainer; Cassandra's scaling characteristics rock the socks off of any SQL DB I've used. The consistency tradeoff is well worth it for some use cases.

[+] ubernostrum|16 years ago|reply

It is really impossible to argument something based on the fact that people that are supposed to be very smart are doing it.

There's been a strong undercurrent of posts which basically consist of ad hominem "non-SQL databases are only for idiots who can't figure out how to manage an RDBMS properly". Pointing out that there are people who very definitely are not idiots and who can manage an RDBMS quite effectively, but who feel non-SQL databases are still appropriate for their use cases is, so far as I'm concerned, an acceptable rebuttal to that.

[+] blhack|16 years ago|reply

>shameless plug as I don't want to post a message just to say this, but isn't HN too slow lately? I'm at the point that I visit the site less often than I was used to do as I don't want to experience the delay.

I've noticed that loading my comment and submission history is really slow, but loading everything else is as snappy as it's ever been.

I have a feeling that the front-page and comments get heavily cached, while our comment and submission history does not.

[+] fogus|16 years ago|reply

It is really impossible to argument something based on the fact that people that are supposed to be very smart are doing it. The only way to support arguments is by showing facts...

I read that as the obligatory appeal to authority that seems to impress some people. The rest of the post however was extremely interesting and likely as fact-filled as it gets when it comes to these SQL vs. NoSQL arguments.

[+] unknown|16 years ago|reply

[deleted]

[+] meroliph|16 years ago|reply

"Let’s say you have 10 monster DB servers and 1 DBA; you’re looking at about $500,000 in database costs."

I wonder what he thinks is a "monster db server", and considering he included the DBA in the price, is this the price per year or what?

Having recently set up a dual E5620 with 48GB of RAM and 8 SSD drives(160GB each) with a 3ware controller as well for just shy of 10K USD, I guess my understanding of "monster" is quite different. For 13K USD the same server would have 96GB of RAM.

[+] wvenable|16 years ago|reply

The numbers in the article are strangely inflated. In addition, as if the need for DBA disappears when you've simply changed your data storage software. Somebody still has to know how it works and manage it.

If you don't need a 50 node cluster because your RDBMS is pulling down big numbers, then you don't multiply the cost of the RDBMS solution by 50 either.

The numbers posted here are pretty reasonable. 37Signals spending $7,500 on disks isn't outrageous. That's less than the cost of a single developer integrating a different solution over a few months. How long has Digg been working on this transition and how many employees did it require? They've probably spent a fortune. Just not on hardware.

[+] AaronM|16 years ago|reply

I believe that he is also including the costs of the licenses for SQL for each processor in that 500k figure.

[+] freshfunk|16 years ago|reply

NoSQL vs RDBMS is really a proxy war for Denormalized vs Normalized storage.

You can take a system like Cassandra and treat data very much in a normalized way which would reduce performance. You can take a system like MySQL and completely denormalize your data which would increase performance.

Any test where one set of data is normalized and one isn't is not a fair test.

Also, denormalization can be a big deal. Unless you have some sophisticated code managing it for you, you're trading performance for data storage management complexity. Now you have to manage many instances of data X. But there is a benefit in that you avoid crazy joins.

I think both have concepts to learn from each other. For example, in order to use a NoSQL option effectively, you end up implementing your own concept of indexes, something very easily done for you in RDBMSes.

[+] Raphael_Amiard|16 years ago|reply

It's very fun how basically most people agrees about this subject, that both systems have their strong and weak points (for example i don't think i've seen many articles saying that facebook/amazon should have kept their whole system running on rdbms), but still the endless queue of blog posts goes on. Isn't this a case of violent agreement , as per http://c2.com/cgi/wiki?ViolentAgreement ?

[+] Semiapies|16 years ago|reply

I don't know about that; the fact that the highest-rated comment on this page is a rant written by someone who apparently didn't RTFA makes me think this is just another holy war in early stages.

[+] sunjain|16 years ago|reply

I guess this is a history repeating itself - before RDBMS there were hierarchical dbs, and other such technologies. There was a reason why RDBMS won the battle. And probably the top most reason was the simplicity with which you can define relation ship between data, store this data(with relationship) and access it easily. We are all use to using SQL, and even though with all the ORMs in the world, it is still probably a very simple(yet powerful) language. If you look at HTTP, the corner stone of web, one of the reasons it has been so popular, is because of it's simplicity yet powerful. And it can thought to represent the same simplicity as RDBBMS (everything revolves around some really basic operations - read, write, update, delete <-> GET, POST, PUT, DELETE). And historically, invariability the technologies which hides the complexity of what they do by providing a simple interface, normally win. And that was one of the main reasons for the success of RDBMS. And which still remains true. And that was precisely the reason why every one started using RDBMS for blog type application, even though it is not the best use of RDBMS. Come to think of it, why would you want to store lot of text (content of each blog) for each blog id in RDBMS - but people did (because it was so simple to do, and that is what they are used to). Hence the use of RDBMS for these type of applications was debatable to begin with. However if you look at transactions management that is required by some of the financial applications, for example, I doubt how far the NOSQL solutions will go in satisfying the requirement. With regards to the scalability, I will not get into the cost factor. Because there are different ways of calculating the actual "cost" of something - if you are Google - the cost of having few PhDs writing your own filesystem and db(or NOSQL) is not much compared to the benefits you are going to get out of it. But if you are not Google, it is a different story. So if we leave the cost factor aside, for a moment, the list of options available with some of the high end RDBMS technologies(with regards to performance), for example Oracle, are quite broad - from active/active clustering technologies(RAC), from different indexing types(b-tree, bitmap, clustered, index-organized), multitude of partitioning types - range based, hash-based(combination of these), list based etc, the list goes on. Same is true about Subase or SQL server(except of active/active clustering). So I am sure the performance issues can be handled in these RDBMS technologies (without just throwing hardware at it).

[+] mark_l_watson|16 years ago|reply

I've never read anything by Joe Stump before but I just bookmarked this article so I can peruse more of his stuff later.

I liked the way he personalized his argument to his own deployment situation rather than making generalizations. I also liked to hear about his experience with Cassandra (5 minutes to clone a hot node and have it balanced and in production).

[+] ck2|16 years ago|reply

Understanding the limits of NoSQL from a RDBMS perspective:

http://www.youtube.com/watch_popup?v=LhnGarRsKnA

(slides: http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=no-sql... )

[+] timtrueman|16 years ago|reply

NoSQL/NoREL is a tradeoff of features for performance. If you don't need certain features and you can make the trade and need the extra performance than it can make sense for some people. I don't think it applies to everyone. They made the decision that was best for them and congrats on that.

Also I can say rotational disks may not provide the economics that make RDBMS seem attractive—but FusionIO cards have really changed that. And I didn't just read the datasheet and get a nerd boner. I watched the queries from 8 beefy physical database boxes (that were getting hammered) combined onto one physical box that was identical in all ways except it had an FusionIO card. It handled 8x the number of queries with ease and could have taken a lot more punishment. Yes, the cards are expensive but in the scheme of getting rid of 7 servers it was actually saving significant amounts of money.

[+] chime|16 years ago|reply

I just wish someone would offer a decent hosted NoSQL platform that I can start using already. https://cloudant.com/ is invite only. So is http://hosting.couch.io/ apparently. https://app.mongohq.com/signup is overpriced, considering it's a cloud service and only 2GB. SimpleDB isn't bad but it has tons of limitations (can't even sort by numbers).

This is why people still use MySQL even for projects that aren't suitable for RDBMS. I use hosted MySQL at dreamhost and don't have to bother with anything except my app and data. It just works and is free with the web hosting package. Is there anything out there that comes close? I don't mind $1/month for 1GB of data. $25 for 2GB is not worth it.

[+] dryicerx|16 years ago|reply

There aren't that many because there's not a big market for it. You ideally want the DB in close proximity (latency wise) to your web server or what ever uses it directly.

Why not host it your self? Deploying a server like MongoDB is trivial to get going.

[+] ryandvm|16 years ago|reply

Isn't that SimpleDB?

[+] wvenable|16 years ago|reply

This response to Forbe's article doesn't address the central premise:

"Shocked by the incredibly poor database performance described on the Digg technology blog, baffled that they cast it as demonstrative of performance issues with RDBMS’ in general, I was motivated to create a simile of their database problem."

The central question here isn't so much the maximum performance you can get out of RDBMS system, or how it compares to a NoSQL solution, but how Digg is getting such terrible performance out of their RDBMS design! The numbers are just don't add up.

This article is just a bunch of straw men and that avoids that main issue. And arguing that $7,500 is too much for a serious web SaaS vendor to spend is just comical.

[+] wanderr|16 years ago|reply

Didn't he?

"Has anyone ran benchmarks with MySQL or PostgreSQL in an environment that sees 35,000 requests a second? IO contention becomes a huge issue when your stack needs to serve that many requests simultaneously."

my answer to this point is that IO contention can be vastly reduced in MySQL (and probably even better handled in Postgres, I bet) with some tweaking of settings and lots of memory. Memory is pretty cheap these days, so stuffing a server full of RAM is really not a bad option.

[+] johnrob|16 years ago|reply

It's just a spectrum. At one end, where all we do is write, we use log files. At the other end, where all we do is read, we cache static content at network hubs. There's a lot in between. An RDBMS is one tradeoff, 'NoSQL' represents another. They simply imply different types of read/write tradoffs.

[+] StrawberryFrog|16 years ago|reply

At Digg we had probably a hundred or so tables, each table had varying indexes (a char here, an integer there, a date+time here)

This may be part of the problem, actually. 100 tables to serve posts with attached comments? Um.

[+] blinks|16 years ago|reply

There's a bit more going on there. See http://www.codinghorror.com/blog/2009/07/code-its-trivial.ht... for a related principle.

[+] wvenable|16 years ago|reply

Actually, my thinking was 100 tables... so what! That's not a large number. Any software of average complexity will have 100 tables, easy.

[+] hvs|16 years ago|reply

Just because some bloggers are obsessed with arguing whether NoSQL or RDBMSs are "better" doesn't mean we need to post every article to HN. Why don't we agree to just stop?

[+] gnaritas|16 years ago|reply

Perhaps because plenty of people are interested in the subject. If you don't like it, don't read it.

[+] Zak|16 years ago|reply

Furthermore, this is an operational expense as opposed to a capital expense, which is a bit nicer on the books.

Something about this seems broken. Why would it be inherently "nicer" to spend money on a service as you use it than on a product that you get to keep?

[+] wrs|16 years ago|reply

Because we're talking about a business. Capital expenses must be written off as depreciation over time. Operational expenses can be written off immediately. For tax reasons this can be a big deal.

However, you can buy servers through a leasing company to get this benefit; you don't have to use EC2.

[+] japherwocky|16 years ago|reply

<3

NoSQL is faster to develop/prototype with also, since you only need to understand json dictionaries.

So your shit ships faster, and scales cheaper.

[+] onetimeiter|16 years ago|reply

use whatever makes your product useful!

89 comments