Runaway complexity in Big Data... and a plan to stop it

[+] confluence|13 years ago|reply

We invented RDMS databases to store large amounts of important data safely on extremely constrained storage hardware (kilobytes were expensive in the 1970s) where you could not economically store redundant copies due to astronomical cost.

Append only immutable DBs are inherently better just because they essentially add a 4th dimension to your data (all history, all data, all the time), and hook up to the fact that Von Neuman machines love splitting things up into constrained sub-streams/problems and crunching through the entire data set in memory across clusters using discrete memory frames/units of work (think Google search index/GPU framebuffers/Integer arithmetic).

Storage is now unlimited and random seeks are very expensive. The more you serialize your processing and split them into in memory units that utilize cache locality - the faster you'll perform. You'll turn performance problems into throughput problems - all you have to do is to keep the data flowing.

RDMS databases are dead for the same reason I don't manage my YouTube bandwith use or bother managing files or bother managing RAM - I have cable now with terabyte hard disks and 32-64 GBs of RAM.

The future is immutable with periodic/continuous historical stream compaction (pre-computation) with queries running across hundreds of clusters and reducing all searches to essentially linear map-reduce (sometimes with b-trees/other speed ups) + a real time dumb/inaccurate stream layer.

1 machine cannot store the Internet - a million machines can. 1 machine cannot process the Internet - but a million machines operating on one 64GB slice of it located in RAM and operating with cache friendly memory-local discrete operations can run through the Internet millions of times a second.

[+] fauigerzigerk|13 years ago|reply

Just a few facts that I have to deal with:

1) I can't afford a million machines or hundereds of clusters or keeping petabytes of data around in a way that doesn't make the data completely useless. Yes mutable state causes complexity, but I can't afford to rid myself of that problem by using brute force.

2) Performance is not replaceable by throughput if you have users waiting for answers on questions they just dreamt up a second ago and you have new data coming in all the time.

3) Cache locality and immutability don't go well together. Many indexes will always have to be updated, not just replaced wholesale with an entirely new version of the index.

[+] pbailis|13 years ago|reply

"Conflating the storage of data with how it is queried" is a misinformed criticism of the relational model. Codd's Relational Algebra (Turing Award material) was in large part a move towards data independence, relaxing what was previously a tight coupling of storage format and application access patterns. Take a look at Rules 8 and 9, or just read the original paper: http://en.wikipedia.org/wiki/Codds_12_rules

> "Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed." http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

It's not clear from the slides how RDBMSs manage to conflate these two concerns in practice.

It's also unclear to me how the "Lambda Architecture" differs from what we've been calling [soft real-time] materialized view maintenance for decades.

[+] nathanmarz|13 years ago|reply

It's not a criticism of the relational model. It's a criticism of the relational database as they exist in the world. The normalization vs. denormalization slides explain how these concerns are conflated. The fact that you have to denormalize your schema to optimize queries clearly shows that the two concerns are deeply complected.

[+] ebiester|13 years ago|reply

Perhaps my DBA-fu is limited, but what databases are doing soft real time materialized view maintenance? The big boys seems to do recomputation for materialized views rather than updating the materialized view on insert or update. I tried to google for this, but saw only research papers rather than implementation.

I know that in places I worked, Materialized views were mostly limited in the application realm because too many of them over enough data brought the DB to its knees.

[+] bcoates|13 years ago|reply

Has anyone come up with a good explanation for why, given the recent proliferation of databases, nobody seems to be making anything more relational than the SQL DBMSes?

It should be easier than ever, particularly if you cheat and declare rotational media an unsupported legacy format.

[+] ebiester|13 years ago|reply

I'm interested enough in the idea to pick up the (partially-completed) book, but I wonder if this isn't far too complex of a solution for 95% of cases. By the time that your system is in the 5% of cases, you're looking at a massive rearchitecting effort.

Of course, to be fair, you're probably already looking at a rearchitecting effort at this point!

However, looking at the architecture diagram, much of the complexity could be hidden behind the scenes with a devoted toolset built on top of Postgres, rather than trying to cobble together Kafka, Thrift, Hadoop, and Storm. Sometimes, one big tool beats a lot of small tools.

For that matter, I wonder if a series of Postgres-XC (or Storm) servers couldn't do the same thing without learning a series of complex tools.

Step 1: Everything goes through a stored procedure. Deletes, updates, and creates are code generated through a DSL. Migrations would be a pain under this system, but mitigated by the fact that the complexity is being handled by the tool.

This stored procedure then writes to the database of record and then sends a series of updates to, essentially a system of real-time materialized views that serve the same purpose as the standard NoSQL schema.

The lambda architecture purposes would still be fulfilled with lower data-side complexity. After I read through Big Data I'll revisit this with a more nuanced view, but I wonder if hadoop and nosql really give you anything.

[+] spenrose|13 years ago|reply

I really like this. Has some overlap with Command-Query Responsibility Segregation, which I have found very useful: http://www.udidahan.com/2009/12/09/clarified-cqrs/

[+] dm3|13 years ago|reply

There was an interesting production-grade event store released recently which fits the described use-case: http://geteventstore.com

Basically, you store all of the data as events (event sourcing) and create separate query views (projections) which are populated from event streams.

[+] nasmorn|13 years ago|reply

Interestingly I am just building a complete analytics setup for my ecommerce site, since Google Analytics doesn't give me everything I want to track. I looked at different things like statsd and redis and whatnot but realized that my data is not actually big. I will just put everything in Postgres and will be able to do that on cheap hardware even when I grow tenfold. Since revenue per pageview is high as in multiple cents there will never be an unsolvable performance issue.

tldr - the hand me down from the 90s can be good enough if you are not big yet

[+] markelliot|13 years ago|reply

Interestingly the chart at the end suggests implementations for everything except the raw data, but generating batch views from an arbitrary store and keeping your raw data reliably are hard problems. The charts motivate dumping this stuff in Hadoop (or some distributed file system), but any reliable store would do. (@nathanmarz: would love recommendations)

[+] nathanmarz|13 years ago|reply

A distributed filesystem, such as HDFS or MapR, is ideal for the master dataset.

[+] lmm|13 years ago|reply

This is a reasonable architecture, but it introduces complexity of its own - you have to implement all your view creation logic in two places, which should go against every programmer's instincts. If you have a system capable of streaming all the events and writing out the reporting views, why not use it for everything.

At my previous^2 job, we had a distributed event store that captured all data and could give you all events (or events matching a very limited set of possible filters) either in a given time range, or streaming from a given time onwards. For any given view we'd have four instances of the database containing it, populated their own streaming inserters; if we discovered a bug in the view creation logic, we'd delete one database and re-run the (newly updated) inserter until it caught up, then repeat with the others (queries automatically went to the most up to date view, so this was transparent to the query client - they'd simply see bugged data (some of the time) until all the views were rebuilt).

The events system guaranteed consistency at the cost of a bit of latency (generally <1 second in practice, good enough for all our query workloads); if an event source hard-crashed then all event streams would stop until it was manually failed (ops were alerted if events got out of date by more than a certain threshold). This could also happen if someone forgot to manually close the event stream after taking a machine out of service (but at least that only happened in office hours); hard-crashes were thankfully pretty rare. Rebuilding the full view after discovering a bug was obviously quite slow, but there's no way to avoid that (and again this was quite rare).

In use it was a very effective architecture; we handled consistency of the events stream in one place, and building the view in another. We only had to write the build-the-view code once, we only had one view store and one event store to maintain. And we built it all on mysql.

[+] blaines|13 years ago|reply

I like this idea but my first impression is it sounds like a lot of work to setup and scale. Am I wrong?

Secondarily, if this were a service, wow!

[+] siganakis|13 years ago|reply

It does seem like a lot of work to me.

One of the problems is servicing two workloads, one for "transactional" processing and another for analysis.

For transactional systems you need to be able to change things quickly and consistently. For analysis you need to be able to query lots of data quickly.

For decades people have realized that these are two separate workloads, so have built 2 systems, a transactional system (on an RDBMS) and a data warehouse (generally on an RDBMS). Data is then shipped between the 2 in batch jobs.

The transactional system is normalized, and the data warehouse is normalized. Within the data warehouse you make denormalized copies of the data that fit the reporting workload required so that as much of the workload is pre-computed as possible.

The problem is that as reporting requirements change, you need to modify these pre-computed stores, as they are very heavily tuned for the particular reporting requirement. Building pre-computed stores is generally done in SQL and can be challenging as you are generally shifting a lot of data and you are trusting the RDBMS to get its optimizations right.

There is a trend now to use Hadoop for the building of these pre-computed stores (and even to use Hadoop for the entire data warehouse). However, writing map-reduce jobs for queries is cumbersome compared to SQL so your productivity suffers. But you don't have to pay Oracle or IBM for licenses.

The key problem is that you need to pre-compute stuff to do fast aggregation, but you can't pre-compute everything. So what you pre-compute is dependent on what your users want, and that changes all the time.

So what you want is a system that lets you change what is pre-computed easily and efficiently.

[+] wellpast|13 years ago|reply

The only novel part of this seems to be that you issue realtime queries over the batch view AND realtime view together; otherwise I don't see what else is interesting here. And I think that novelty is unneeded (and therefore only additional complexity).

Why don't you just: 1) Master your data in "normal form" (graph/entity-based data models like Datomic or Neo4j are perfect for this) in a scalable database (e.g., Cassandra) 2) Use Hadoop for offline analytics 3) Index all of the data needed for realtime queries in a search engine (e.g., elasticsearch), smartly partitioning the data for performance

As for "schema" changes -- you don't need them in an entity-based model because you can always be additive with the attributes in your schema. (You may wish to have a deprecation policy for some attributes but doing this allows you to as slowly as you'd like update any applications and batch jobs that depend on deprecated attributes.)

[+] nathanmarz|13 years ago|reply

It all comes down to Query = Function(All data). The architecture I describe is a general approach for computing arbitrary functions on an arbitrary dataset (both in scale and content) in realtime. Any alternative to this approach must also be able to compute arbitrary functions on arbitrary data. Storing data in Cassandra while also indexing it in elastic search does not satisfy this, and it also fails to address the very serious complexity problems discussed in the first half of the presentation of mutability and conflating data with queries.

The Lambda Architecture is the only design I know of that:

1) Is based upon immutability at the core.

2) Cleanly separates how data is stored and modeled from how queries are satisfied.

3) Is general enough for computing arbitrary functions on arbitrary data (which encapsulates every possible data system)

[+] gbvb|13 years ago|reply

The interesting part about timestamping change to records is essentially "effective dating" and it is bread and butter in any transactional system that needs to record employee info. This is done varying degrees to columnar values or entire rows themselves. In an RDBMS, you will create a compound key to allow you to know the history and the current record. I guess the new fangled NoSQL will make it easier to store it easily..?

[+] NDizzle|13 years ago|reply

I don't have a critique for the content yet, and HN is usually the place to post this kind of stuff so here goes.

My initial reaction is to close the window immediately due to the font on the slides. I don't know if that is the fault of my browser (chrome 21.x on this PC), slideshare, or the presenter who put together the slides, but the staggered characters drive me bonkers.

[+] Jare|13 years ago|reply

My bet is on Chrome's sometimes clumsy font rendering. Go fullscreen on the presentation and the characters will render fine.

Edit: scratch that - each letter shows up as a separate span in the generated HTML. There's no sane way to typeset that.

[+] ChuckMcM|13 years ago|reply

It would be useful if there was a speaker narrative, it would be better still if there was a white paper that this presentation was just summarizing.

What I got out of it was "Just store the raw data", always compute results/queries, BigTable and Map/Reduce are cool. I felt like I missed something perhaps someone here can help.

[+] nathanmarz|13 years ago|reply

The first chapter of my book discusses a lot of what's in the presentation http://manning.com/marz/BD_meap_ch01.pdf

[+] bcoates|13 years ago|reply

I'm not sure how to square the normalized schema with the immutability; normalization implies schema changes which imply data changes, no?

In particular, what do I do when there is erroneous historical data that violates the new schema (the newly discovered constraint that was there all along)?

[+] siganakis|13 years ago|reply

Essentially they end up rebuilding the "transaction log" that most RDBMS's use to commit data safely.

Before any data is written to the "data" files of the database, it is written to a "log" file sequentially. Once that has succeeded, it will write to the data file, then write to the log file again to say that the write was successful.

That way if the database breaks during the write, the system knows about it. It's also useful because it allows you to restore the database to a point in time, as you have a full log of all the transactions run (and you know how to un-run them).

In this case, they just make the transaction log a little more accessible.

For your example, erroneous historical data would remain in the system but with a new record being added indicating that at a certain date and time the old record was replaced with the new one.

Basically if you think of a store processing a purchase for $100 and refund of $100 as 2 transactions (for $100 and -$100) rather than simply deleting the first transaction.

This is important because things may have happened between the deposit of money and its refunding (e.g. interest payments).

[+] lmm|13 years ago|reply

You have options. You can apply the schema restriction only to new inserts (since you have full control over where it's applied - the "all data" store is generally not an SQL DB). You can "retcon" the old data (and knowing when to violate immutability is what separates the good from the great). You can make a new "table", and put the logic for dealing with the difference between old and new formats in the batch processor.

[+] calibraxis|13 years ago|reply

Does anyone know what his NewSQL critique was? (The "“NewSQL” is misguided" slide. BTW, I've read the released chapters of his _Big Data_ book, so I know a fair bit of the slides' context.)

[+] nathanmarz|13 years ago|reply

NewSQL inherits mutability and the conflation of data and queries. It's no longer 1980, and we now have the capability of cheaply storing massive amounts of data. So we should use that capability to build better systems that fix the complexity problems of the past.

[+] DanWaterworth|13 years ago|reply

> MapReduce is a framework for computing arbitrary functions on arbitrary data.

No, MapReduce is a framework for computing catamorphisms over trees.

[+] Evbn|13 years ago|reply

I think he meant catamorphisms of arbitrary functions over trees of arbitrary data. As in, you can apply any function at the leaves, and the leaves can be any type.

[+] danso|13 years ago|reply

The OP is an engineer at Twitter...doesn't Twitter use NoSQL to store tweets? I know your critique of NoSQL is that it is an overreaction to the pain of schemas, so I'd like to see some discussion on how to implement lambda with data architecture that, at some point, uses document databases.

Thanks for the post...slideshare is annoying to read (on an iPad) but the content was digestible. Mostly, it helped remind me to download the update of the OP's book

61 comments