top | item 39876698

(no title)

mccanne | 1 year ago

Necessity is the mother of invention. MapReduce-based systems were developed because the state-of-the-art RDBMS systems of that age could not scale to the needs of the Googles/Yahoos/Facebooks during the phenomenal growth spurt of the early Web. The novelty here was the tradeoffs they made to scale out and up using the compute and storage footprints available at the time.

"We thought of that" vs "we built it and made it work".

discuss

order

dekhn|1 year ago

MapReduce was never built to compete with RDBMS systems. It was built to compete with batch-scheduled distribution data processing, typically where there was no index. It was also built to build indices (the search index), not really use them during any of the three phases. It was also built to be reliable in the face of cheap machines (bad RAM and disk).

Google built MR because it was in an existential crisis: they couldn't build a new index for the search engine, and freshness and size of the index was important for early search engines. The previous tools would crash part-way through due to the cheap hardware that Google bought. If Google had based search indexing on RDBMS, they would not exist today.

Now Google did use RDBMS- they used MySQL at scale. It wasn't unheard-of for mapreduces to run against MySQL (typically doing a query to get a bunch of records, and then mapping over those records).

I worked on later mapreduce (long after it was mature) which used all sorts of tricks to extend the MapReduce paradigm as far as possible but ultimately nearly everything got replace with Flume, which is effectively a computational superset of what MR can do.

I think the paper must have been pulled because Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at. See the original paper for what they proposed as good use cases: counting word occurences in a large corpus (far larger than the storage limits of postgres and others at the time), distributed grep (without an index), counting unique items (where the number of items is larger than the capacity of a database at the time), reversing a graph (convert (source, target) pairs to (target, [source, source, source]), term vectors, inverted index (the original use case for building the index) and distributed sort. None of the RDBMS of that day could handle the scale of the web. That's all.

makmanalp|1 year ago

> Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at

I like this comment because it gets to the heart of a misunderstanding. I'd further correct it to say "for something it wasn't trying to be good at". DeWitt and Stonebraker just didn't understand why anyone would want this, and I can see why: change was coming faster than it ever did, from many angles. Let's travel back in time to see why:

The decade after mapreduce appeared - when I came of age as a programmer - was a fascinating time of change:

The backdrop is the post-dotcom bubble when the hype cycle came to a close, and the web market somewhat consolidated in a smaller set of winners who now were more proven and ready to go all in on a new business model that elevates doing business on the web above all else, in a way that would truly threaten brick and mortar.

Alongside that we have CPU manufacturers struggling with escalating clock speeds and jamming more transistors into a single die to keep up with Moore's law and consumer demand, which leads to the first commodity dual and multi core CPUs.

But I remember that most non-scientific software just couldn't make use of multiple CPUs or cores effectively yet. So we were ripe for a programming model that engineers who've never heard of lamport before can actually understand and work with: threads and locks and socket programming in C and C++ were a rough proposition, and MPI was certainly a thing but the scientific computing people who were working on supercomputers, grids, and Beowulf clusters were not the same people as the dotcom engineers using commodity hardware.

Companies pushing these boundaries were wanting to do things that traditional DBMSes could not offer at a certain scale, at least for cheap enough. The RDBMS vendors and priesthood were defending that it's hard to offer that while also offering ACID and everything else a database offers, which was certainly not wrong: it's hard to support an OLAP use case with the OLTP-style System-R-ish design that dominated the market in those days. This was some of the most complicated and sophisticated software ever made, imbued with magiclike qualities from decades of academic research hardened by years of industrial use.

Then there was data warehouse style solutions that were "appliances" that were locked into a specific and expensive combination of hardware and software optimized to work well together and also to extract millions and billions of dollars from the fortune 500s that could afford them.

So the ethos at the booming post-dotcoms definitely was "do we really need all this crap that's getting in our way?", and we would soon find out. Couching it in formalism and calling it "mapreduce" made it sound fancier than what it really was: some glue that made it easy for engineers to declaratively define how to split work into chunks, shuffle them around and assemble them again across many computers, without having to worry about the pedestrian details of the glue in between. A corporate drone didn't have to understand /how/ it worked, just how to fill in the blanks for each step properly: a much more viable proposition than thousands of engineers writing software together that involves finnicky locks and semaphores.

The DBMS crowd thumbed their noses at this because it was truly SO primitive and wasteful compared to the sophisticated mechanisms built to preserve efficiency that dated back to the 70s: indexes, access patterns, query optimizers, optimized storage layouts. What they didn't get was that every million dollar you didn't waste on what was essentially the space shuttle of computer software - fabulously expensive and complicated - could now buy a /lot/ more cheapo computing power duct taped together. The question was how to leverage that. Plus, with things changing at the pace that they did back then, last year's CPU could be obsolete by next year, so how well could the vendors building custom hardware even keep up with that, after you paid them their hefty fees? The value proposition was "it's so basic that it will run on anything, and it's future proof" - the democratization aspect could be hard to understand for an observer at that point, because the tidal wave hadn't hit yet.

What came was the start a transition from datacenters to rack mounts in colos and dedicated hosts to virtualization and very soon after the first programmable commodity clouds: why settle for an administered unixlike timesharing environment when you can manage everything yourself and don't have to ask for permission? Why deal with buying and maintaining hardware? This lowered the barrier for smaller companies and startups who previously couldn't afford access to such things nor markets that required them, which unleashed what can only be described as a hunger for anything that could leverage that model.

So it's not so much that worse was better, but that worse was briefly more appropriate for the times. "Do we really need all this crap that's getting in our way?" really took hold for a moment, and programmers were willing to dump anything and everything that was previously sacred if they thought it'd buy them scalability, schemas and complex queries to start.

Soon after, people started figuring out how to maintain all the benefits they'd gained (democratized massively parallel commodity computing) while bringing back some of the good stuff from the past. Only 2 years later, Google itself published the BigTable paper where it described a more sophisticated storage mechanism which optimized accesses better, and admittedly was tailored for a different use case, but could work in conjunction with mapreduce. Academia and the VLDB / CIDR crowd was more interested now.

Some years after that came out the papers for F1 and Spanner, which added back in a SQL-like query engine, transactions, secondary indexes etc on top of a similar distributed model in the context of WAN-distributed datacenters. Everyone preached the end of nosql and document databases, whitepapers were written about "newsql", frustrated veterans complained about yet another fad cycle where what was old was new again.

Of course that's not what happened: the story here was how a software paradigm failed to adapt to the changing hardware climate and business needs, so capitalism had its guts ripped apart and slowly reassembled in a more context-applicable way. Instead of storage engines we got so many things it's hard to keep up with, but leveldb comes to mind as an ancestor. With locks we got was chubby and zookeeper. With log structures we got kafka and its ilk. With query optimizer engines we got presto. With in-memory storage we got arrow. We got a cambrian explosion of all kinds of variations and combinations of these, but eventually the market started to settle again and now we're in a new generation of "no, really, our product can do it all". It's the lifecycle of unbundling and rebundling. It will happen again. Very curious what will come next.

loeg|1 year ago

My recollection of the time is that lots of people thought they needed to use MapReduce for their "big data" but their data was like 100GB of logs they wanted to run a O(N) analysis on.