top | item 2798813

NoSQL is What?

137 points| timf | 14 years ago |blog.zawodny.com | reply

60 comments

order
[+] fauigerzigerk|14 years ago|reply
Clearly, we have to identify the non scaling or performance related qualities of NoSQL for the debate to make any sense. I don't think it is possible in general to define those qualities, because NoSQL systems don't have much in common. Using a negation to name the category is telling in itself.

You mention schemaless, but non of the BigTable derived systems are schemaless. Key-value stores are schemaless but RDBMS can do key-value storage just fine as can file systems.

I think this whole debate boils down to whether or not you need to normalize data. If you normalize, you need joins and that's the weak spot of most NoSQL systems. Doing joins in procedural code requires all data to be transferred into application process memory, which is only viable for modest amounts of data. (I'm not saying that only RDBMS can ever do proper joins, just that the popular NoSQL solutions in use today don't)

Normalization is also what mandates ACID because normalization means you're losing what I would call the "physical unit of consistency". Normalization, joins and ACID go together. It's all or nothing. (Of course pragmatically it's never all or nothing but it's useful to highlight the general point)

So, my conclusion is this: Use RDBMS or don't normalize (much). All the debates around RDBMS or NoSQL being simpler or more complicated turn out to be implicit debates about the need for normalization. When some people say this or that model is simpler, they either imply or don't imply a need for normalization.

In my view, whether or not you need to normalize depends primarily on whether or not the data is single purpose or multi purpose. If it's one app and its own private data island, then not normalizing often makes sense for simplicity and performance reasons.

If the data has it's own seperate life cycle, idependent of any individual app, then not normalizing is a terrible mistake that brings down everyone's productivity no matter how simple it may appear initially.

Having worked on data integration and anlytics projects for many years, I'm leaning towards the view that most data is multi purpose even if it's not initially expected to be. But that may well be survivors bias as apps that die young never cause integration issues. That doesn't mean they haven't fulfilled their original purpose.

[+] andrewcooke|14 years ago|reply
If the data has it's own seperate life cycle, idependent of any individual app

this is the conclusion i've been heading towards too. when you use an RDBMS there's typically a layer of abstraction between the base schema and the domain model (the concepts your application "works with"). it may be nothing more than the queries used to extract data, or it may be a complex set of views and triggers. but deciding whether or not that layer is "a good thing" helps choose between relations and nosql approaches. if your data have their own logic, separate from your application, then this layer helps you match your (often evolving) application to the (frequently more static) data. but if your data are closely tied to your application then it simply "gets in the way".

and i agree, too, that data typically do have their "own" logic. on the other hand, i think people could argue that there are cases where one application becomes so large, and so dominant, that it can make sense for that to "drive" the data. which helps explain the idea that nosql and scaling go together (when you really, desperately, need to scale, it could be because one particular thing is so huge that it drives everything else).

[+] St-Clock|14 years ago|reply
I really like this idea of basing your decision on the need to normalize or not. It certainly fits document-oriented databases and key-value stores, but I'm less sure about column-oriented databases (I have no experience with them, except many hours trying to understand them...).

The single-purpose vs. multi-purpose that comes from denormalization vs. normalization would explain why certain companies stick to RDBMS for their main data and use NoSQL only for certain specific scenarios.

[+] vog|14 years ago|reply
> Key-value stores are schemaless but RDBMS can do key-value storage just fine as can file systems.

Moreover, any serious RDMBS has an XML data type for efficient storage of unstructured/semi-structured data. Some even include index mechanisms to handle XPath/XQuery expressions efficiently, so handling tree-structured data isn't an issue, either.

[+] haberman|14 years ago|reply
"threw up in my mouth a little." "Gee, let me get this straight." "Bullshit." "Seriously?"

I have to say that one of my regrets about growing up in programmer circles is seeing stuff like this held up as an acceptable example of how adults communicate with other adults.

It took me a long time to realize that this style of communication is not necessary, is not effective, and reflects poorly on the speaker. C'mon, this guy appears to be in his 30s or 40s and has written books, so why does he write like he's an angsty teen? (I know Linus does it. I think it's lame when he does it too.)

There's still room for humor and snark, here are three of my favorite blog postings/articles ever, all very snarky, but not embarrassingly juvenile:

http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-...

http://diveintomark.org/archives/2004/01/14/thought_experime...

http://www.info.ucl.ac.be/~pvr/decon.html

[+] rdouble|14 years ago|reply
If you're a programmer of a certain vintage, you've been reading writing like this on mailing lists for years. It's easy to think programmers in particular produce embarrassing writing. However, most blogs are written in this style, not just blogs by programmers. This issue is not confined to the internet, or the contemporary age. The editorial page in my hometown's daily paper was not any better. If one reads historical letters to the editor in regional newspapers, you'll find that adults have been communicating in poorly written, juvenile language for centuries .
[+] neilk|14 years ago|reply
This is how adults communicate with one another. You are longing for some era of erudite euphemism that never was and never will be.

Blogging is vernacular. Just because you see it in print doesn't mean we have to judge it like an essay from the Saturday Evening Post.

If Zawodny's piece was only snark, you might have a point, but he backs it up with solid argumentation. If you paid more attention to his vinegary interjections than the main point, it says more about you than it does him.

[+] j_baker|14 years ago|reply
I suspect this is something you and I will never agree on. You see, I think it reflects poorly on someone to see bullshit and not call the bullshitter out. That doesn't give someone unlimited license to say what they want. It's just that if someone is spewing bullshit, they deserve to be told they're spewing bullshit. And more importantly, people on the receiving end of said bullshit deserve to be told it's bullshit as well.
[+] arctangent|14 years ago|reply
It's a stylistic flourish and is likely not representative of how he behaves in person. The same can be said for many journalists.
[+] justin_vanw|14 years ago|reply
Ok, whenever someone opens with something to the effect of 'I use MySQL, so I have experience with relational databases and can make a comparison with NoSQL' all credibility is lost.

MySQL is a 'relational database', but one in which JOIN is so expensive and poorly optimized that you almost have to use it as a key-value store, looking everything up directly with synthetic primary keys.

I've had this discussion several times. Some startup guys say 'we should look at NoSQL', and I ask questions to get to the bottom of why they think that. They will say something like 'we have this huge join we have to do, but it's too expensive, so we pre-compute it'. I ask more questions, and the 'huge join' is not huge at all, in fact it is just a reasonable join, something that you could expect to do on every page view without difficulty. Well, except they are using MySQL, and it can't join for shit. The MySQL query planner is disgusting.

So, although I don't expect to persuade the world to stop using MySQL (to be honest, I love that it is the go-to thing, those of us who use a decent database like Postgresql end up with a huge competitive advantage, better performance, more features, more scaleable, amazing query planner, top shelf performance analysis), I think we should at least admit that in practice, to get any performance out of it, you have to effectively use it as a key-value store anyway. And when comparing MySQL, which is a shitty key-value store, against real key-value stores, you can make a case for some NoSQL thing.

[+] jister|14 years ago|reply
I have to agree with one of the comments. All you did was rant and didn't say something useful. Perhaps you can tell your readers about your experiences so that you can convince them that NoSQL is useful (Of course, I am NOT saying it isn't) to implement in their projects?
[+] unknown|14 years ago|reply

[deleted]

[+] jerrya|14 years ago|reply
I did find this point from the original article to be very dubious:

In fact, I would argue that starting with NoSQL because you think you might someday have enough traffic and scale to warrant it is a premature optimization, and as such, should be avoided by smaller and even medium sized organizations. You will have plenty of time to switch to NoSQL as and if it becomes helpful. Until that time, NoSQL is an expensive distraction you don’t need.

Consider:

- how hard most organizations find it to refactor, rewrite, retest, especially in systems that are online 24x7

- when would you prefer to climb the learning curve with an immature technology, when you are small and starting out, or when you are a large company with a large set of users and under "mission critical" constraints (and possibly stockholders and the like.)

My guess is that ongoing companies find it extremely difficult and expensive (and wanting for talent) to switch from one sql database to another, much less switch from sql to nosql.

[+] wfarr|14 years ago|reply
It's not a matter of SQL vs. NoSQL. They are complementary.

The fact of the matter is, there are some components of systems for which Redis, Riak, etc may be better suited than SQL in the long term. Starting out, keeping everything in SQL provides less friction, but as time goes on it may be necessary to scale the component separately from typical relational data storage, and that's the point at which these switches are evaluated. These companies would be replacing SQL only in these components — not across the board.

The myth of a silver bullet datastore solution is just that: a myth. Different data stores have different strengths and weaknesses and it becomes necessary to mix and match at scale.

To quote Benjamin Black: "Scale is pain, princess. Anyone who tells you different is selling something."

[+] jzawodn|14 years ago|reply
You hit the nail right on the head.
[+] mattmanser|14 years ago|reply
When you're small and starting out is definitely not the time to be mucking around on a learning curve.

Learning a new tech in a startup is doing a lot of 'busy' work that is only beneficial to you as you're learning something, it doesn't benefit the business, it slows it down. You're also more likely to make fundamental mistakes in your implementation as you don't know the tech.

And switching when you're running is not as hard as you'd think as you already have the domain knowledge of how the solution actually needs to works.

It's all a balancing act, if the new tech is a fundamental selling point (for example your program's 10x faster than incumbents) I can understand it. If it's to deal with future scalability problems, well, that'll be a good problem to deal with later.

[+] flocial|14 years ago|reply
This is opinion versus opinion. I'm sorry to say there's no real content here. The author went from Yahoo to Craigslist so there's no such thing as premature optimization at that scale and with the small staff at CL you can be sure that chasing NoSQL as a fad can ruin the company. Obviously he doesn't fit the bill of the essay he's criticizing but most devs don't experience the scale of his problems.

You can't do the topic of NoSQL vs SQL justice with an essay because it would just be semantic, we're talking about a different theoretical representation of data structure. You might as well scream "better taste!", "Less filling!".

[+] jzawodn|14 years ago|reply
Agreed. This is opinion.

Mine is based on years of experience, but I got the impression that the original article was written based on some cherry-picked reports of what a few companies said (as opposed to actually being there and doing it).

Maybe I'm being overly critical?

[+] LeafStorm|14 years ago|reply
One thing that bothers me is people who talk about "SQL databases vs. NoSQL databases." That's like framing a debate on transportation as "Cars vs. Not Cars," where "Not Cars" includes bicycles, planes, buses, subways, boats, zeppelins, etc. etc.

If you take CouchDB, Redis, MongoDB, and all the other "NoSQL" databases and compare them, the only thing they share in common is that they do not use a relational data model or SQL. The way the word "NoSQL" is used, however, implies that they are some kind of united front against SQL databases, which is not the case at all. (It's why I am not a big fan of the term.)

Just like you would not use bicycles, planes, subways, and boats for the same things, you would not use CouchDB, Redis, MongoDB, and Cassandra for the same things. If you're choosing a database just because it's "NoSQL," then you are completely missing the point.

[+] mitchty|14 years ago|reply
I think the problem is the term NoSQL itself, originally was penned as Not Only SQL. But everyone now looks at the term with No being the actual word No in relation to SQL, as if there is some war between SQL and not...SQL. I think that alone is causing more heartburn than needed between the two camps.
[+] jpterry|14 years ago|reply
Firstly, I can attest that migrating the datastore of an application which has scaled to require a NoSQL solution is no trivial task.

Secondly, I believe the author of the original posting really meant that "premature optimization is the root of all evil." Like this post points out, NoSQL solutions vary wildly in their abilities and usefulness. A relational database is a good place to start on the path to an MVP. And if you need features that a NoSQL solution can provide, and you understand the problem you're trying to solve, then use a NoSQL solution.

[+] jzawodn|14 years ago|reply
I think that most people who argue that such a migration "isn't that bad" haven't actually done it. Or at least they haven't done it for anything sizable.
[+] swampthing|14 years ago|reply
Obviously this doesn't really have any bearing on points the author is making, but a small nit for posterity's sake - I think the point Clayton Christensen was making in The Innovator’s Dilemma was not that people should adopt inferior technologies to gain leverage later.

I think the point in that book was more that new technologies are often inferior in many ways to existing technologies when they first start out, and the way these new technologies survive/grow is by appealing to niches that value the existing ways in which the new technology is superior. Then, when the new technology matures a little more, the market to which it appeals grows a little larger, and this repeats.

[+] jhawk28|14 years ago|reply
The problem is that NoSQL is such a broad term for datastores. Some of them are simple (like redis) and some more complex (like Cassandra/HBase). They also have different targets for data types. Using one just because it is a NoSQL can be a premature optimization just like using a RDBMS can be a premature optimization. You really need to understand the data and how it will be used. Before you know what you want to build, it is easy to prematurely optimize for something you don't need.

Start simple, then iterate...

[+] tapvt|14 years ago|reply
Undertaking "optimization", in this case selecting and developing with a NoSQL datastore early in the process, should only be considered premature if the costs of doing so (which will be mainly represented by developer-hours spent) are greater than the value provided by having a datastore that can accommodate well the needs of the application itself, development team, and end-users.

Adaptability, flexibility (with regard to schema/key structure migration and maturation), as well as ease of partitioning data intelligently ahead of demand are all hugely important factors that can and often should inform the process of selecting a datastore.

If the datastore selected for use: - shortens development time, - provides improved performance for anticipated scale, - better represents the data model needing to be captured, - avoids re-work and "post"-mature optimization of data models & datastores, - or accomplishes any combination of the above ... ... then the selection of that datastore should not be considered premature optimization.

Finding that your traditional RDBMS does not well support the data models you have developed, especially once the product is out of the gate, will not be fun. Having to engage in a refactor and data migration to move to a more appropriate or more performant datastore will be a time- and resource-consuming process.

As soon as the initial synthesis phase of development can begin, it may be well worth the effort to experiment with multiple datastores as a means of evaluating their performance and suitability. Depending on the scope and potential for the project to scale, modularizing distinct pieces of core functionality into separate services, each with their own most-suitable datastore, can also provide great benefit in flexibility of development processes, as well as adaptability of the product to the demands of the end-users.

[+] antirez|14 years ago|reply
when there are arguments, like in the Jeremy post, commenting about tones and formal things is a huge FAIL. It is part of the expression of everyone to use the words and tones he wishes, as long as no one is going to be offended (if you are super-sensible this is your problem). One thing I always feel as a problem is that the programming community here in HN is a bit too middle class-ish, this is annoying: you are off topic, you are not polite, respect the fact I don't understand, blablabla. Hacking is in my vision connected with cultural freedom, and not being polite is not the only but one of the possible expressions. So reply to arguments and stop to be so childish.
[+] Devilboy|14 years ago|reply
He's only experienced with MySQL? How can he judge the SQL vs NoSQL battle when he's never used a proper SQL system? NoSQL does not 'save development time' in general, it's just a different tool. A much younger and less refined one at that. Real RDBMSs do a whole lot more than execute your SQL queries for you.
[+] bad_user|14 years ago|reply
I don't think there's a SQL versus NoSQL battle, whatever that means.

SQL refers to relational databases, which are databases using the "relational model" of representing data: http://en.wikipedia.org/wiki/Relational_model

This means that any SQL database is very flexible in regards to what you can store in it, not to mention that it is based on proved theory and battle-tested implementations of various features, like ACID.

But the relational model also breaks heavily when wanting to work with data structures that don't blend well -- like graph data. It also breaks down heavily when you want to spread your data across many servers. It is also not well suited to storing and querying billions of records -- sooner or later, your indexes are going to go beyond whatever storage / RAM capacity your servers have.

Btw, MySQL is a real RDBMS. Even if it lacks some features, it doesn't lack anything essential to calling it "relational" and talk about advantages or disadvantages of RDBMSs versus key-value stores or other NoSQL types.

[+] flocial|14 years ago|reply
He spearheaded the adoption of MongoDB and probably Redis at Craigslist. That's more action than most commenters will see in their entire careers.
[+] alnayyir|14 years ago|reply
Proper SQL? Nice No True Scotsman.

Everybody has a pet feature in their preferred SQL DB that they think makes it a "real" SQL database, Postgres and Oracle people in particular. I agree that MySQL is a bit janky, but get real.

[+] i_crusade|14 years ago|reply
"Again, I think we need to talk about the best tool for the job, not the best tool for every job. Relational databases are not the best tool for every data storage job."

Pretty much disqualifies him as moron. Hell, he doesn't say anything.

[+] burgerbrain|14 years ago|reply
Does this mean that you think that relational databases are the best tool for every data storage job?