Denormalization for Performance: Don't Blame the Relational Model

[+] EGreg|9 years ago|reply

From my experience, denormalization of a relational model is a special case of "caching". What you are really doing is caching data where it is most likely to be used at access time.

The overview is like this:

1) You need some operation to have low turnaround time / latency

2) So you maintain and update a cache, typically while writing to the data store.

3) Like all caches, you can either invalidate it before changing the data, slightly harming availability, or you can have it lag behind (eventual consistency).

So the article is actually wrong, you don't need to trade consistency for performance. You can increase read performance (lower latency) without losing consistency, by having a cache (denormalization) and invalidating the relevant caches BEFORE writing data (which lowers write latency, but not necessarily write throughput... typically, we don't care about write latency as much as read latency.)

[+] idbehold|9 years ago|reply

Well you clearly have cache invalidation figured out. Any tips on naming things?

[+] sgeneris|9 years ago|reply

To a hammer ...

[+] xiphias|9 years ago|reply

If you write often, another way is to use triggers. They can be extremely useful for this kind of scenerio.

[+] perfunctory|9 years ago|reply

"Well implemented true RDBMSs..." exist only in the weird fantasy world of relational purists. In that perfectly normalized world you just wave your magic wand and all performance problems disappear.

[+] kpmah|9 years ago|reply

I think most are pragmatic and admit you may have to do denormalise for performance (e.g. I've read this in C.J. Date's books). I interpret their point as the need for denormalisation is not a flaw with the relational model, but a flaw in its implementation. The relational model does not specify an on-disk format.

[+] sgeneris|9 years ago|reply

They are a fantasy only because users do not grasp the advantages of the RDM and confuse them with SQL, thus robbing vendors from any incentive to produce true RDBMSs.

If users are so accepting of denormalization as a solution to performance and are unaware of all its costs, why should vendors bother to come up with true RDBMSs that obviate the need to denormalize?

As long as users do not understand the RDM and its benefits true RDBMSs will remain a fantasy. Ignorance has never been conducive to progress.

[+] eecc|9 years ago|reply

I can imagine an optimizer that having analyzed and simplified the query plan notices that certain attributes bubble up all the wait to the final projection. Coupled with some statistics, it's entirely possible for a DBMS to maintain a precomputed partial result cache.

[+] jasode|9 years ago|reply

Responding to several comments. (The _underscores_ in the following quotes are mine for emphasis.)

>kpmah: I think most are pragmatic and admit you _may_ have to do denormalise for performance

>calpaterson: So _rare_ to actually have to denormalise for performance today though.

The article's author didn't describe it as "may"/"rare" which leaves some wiggle room for the cases of denormalization required in real-world implementations. Instead he used absolute qualifiers such as "never", "no reason", "any":

- the "additional development costs" that Bolenok refers to -- but they would _never_ be justified:

- "consistency-performance tradeoff [...], there is _no_ reason to expect _any_."

The author does write his advice in abstract terms instead of discussing concrete "case studies" so we are left to speculate what mental model of the database world he holds in his mind when he's rigid with strict rules of normalization and relational purity. Based on the topics in his papers[1], I'm guessing his world consists of a single OLTP database. E.g, you develop a non-cloud restaurant reservation & POS system with a single-instance database. Yes, you don't need any denormalization hacks in that scenario.

But for other problems such as distributed databases, you can't do joins across 2 geographically separated data centers. (Well, you theoretically could do it but the slow nested-loop performance across the WAN would make it unusable for a real-world application.) Some duplication via data denormalization is required and no "sufficiently smart db engine"[2] can automatically optimize for it. An application architect has to manually make that design decision and live with the deliberate tradeoffs. (E.g., batch jobs now have to be run to periodically keep databases at different datacenters in sync.) There are many real-world scenarios that require denormalization which have nothing to do with a junior programmer's lack of SQL knowledge to join 20 tables to populate an "edit customer" data entry screen.

[1]http://www.dbdebunk.com/p/papers_3.html

[2]riff on the theoretical "Sufficiently Smart Compiler" to solve all performance optimizations so there's never a need to write performance-specific language syntax or switch to a "faster" language... because as we all know... "languages" are not fast or slow -- it's the "implementations" that are fast or slow. That Ruby is not as fast as C/C++ is an implementation detail (SSC) and not the issue of the language syntax.

[+] exmicrosoldier|9 years ago|reply

I find zero explanation of how to solve performance with a relational model.

As I understand the article, it seems to say...just because all the existing databases you have seen suck at performance when normalized doesn't mean normalization can't be fast.

[+] julochrobak|9 years ago|reply

There are several basic concepts you can apply to improve performance in the RDBMS and still avoid denormalization. For example:

* use as many constraints as possible (this helps the query optimizer)

* use indexes which bring better performance in your use case (e.g. bitmap join index or even index-organized tables)

* apply table/index partitioning

* use materialized views as a query result cache

[+] sgeneris|9 years ago|reply

If you are looking to solve performance "with a relational model" then you do not understand physical independence and the relational model. This is exactly what the article explains you should not do.

You mean DBMSs, not databases. Yes, that's the argument, but it's precisely this kind of lack of understanding that prevents better RDBMSs.

[+] unknown|9 years ago|reply

[deleted]

[+] haddr|9 years ago|reply

Very nice article, but the claims are so far from what is practiced in the industry nowadays (see all companies working with tons of data and popping up big data solutions).

So, while it might be perfectly true it somehow misses some big point.

[+] sgeneris|9 years ago|reply

No, I think you missed its point.

I dare you to demonstrate that the "big data solutions" guarantee logical and semantic correctness the way true RDBMSs do. Without that, garbage in garbage out. But because those "solutions" are so complex that nobody understand them, including those who designed them, their results are BELIEVED to be correct, which is not the same thing as being correct.

[+] hvidgaard|9 years ago|reply

I have never worked at true "web scale", so my experience may or may not apply to such scenarios. I tend to make a more or less fully normalized structure, and introduce aggregated data outside of the core database. It is essentially introducting eventual consistency, for performance critical data. This data can be stored in a database or a cache layer in front of the database if it is accessed frequently enough to keep it off the disk all the time.

[+] woliveirajr|9 years ago|reply

Good luck trying to show the number of likes in twitter, instagram, face, Kardashian's tinder...

Any query that needs to calculate some result and has many rows benefit from storing pre-calculated values. With a trade-off in precision, correctness, and so on.

Might apply to your business or not.

[+] sgeneris|9 years ago|reply

What does it have to do with normalization?

[+] RmDen|9 years ago|reply

As the saying goes... Normalize till it hurts..then denormalize till it works :-)

[+] sgeneris|9 years ago|reply

What do you know about the hurts of denormalization? Can you even specify what they are?

[+] sgeneris|9 years ago|reply

One more thing: As these comments demonstrate, there is almost exclusive focus on performance, but nobody considers the drawbacks (cost) of denormalization. Practitioners are oblivious to them.

[+] thom|9 years ago|reply

What systems that are usable today come closest to the Platonic ideal of RDBMSs?

[+] calpaterson|9 years ago|reply

Today most RDBMSs have good query planners. Even recent versions of MySQL. In my experience you very often get improvements in performance by normalising (to BCNF). Most of the benefit comes from a) narrow tables reducing the physical size of the working set (hold more of it in memory) and b) narrow tables handle writes quicker (more rows per page) c) more opportunities for covering indices.

62 comments