Scrap your MapReduce – Introduction to Apache Spark

[+] jnaour|11 years ago|reply

Good introduction. Spark is really a project to watch in the data analysis field on distributed architecture. We had performed several benchmarks and Spark keeps its promisses. 2.5x faster comparing to Pig for the same algorithm on the same cluster.

For iterative algorithm with the in-memory possibilities, performances are really good comparing to Hadoop.

The project is still young with several bugs but the documentation is really good and the code is well commented and robust.

[+] deadgrey19|11 years ago|reply

As part of our work we have done extensive comparisons of Spark on various workloads, clusters and cluster sizes comparing with Hadoop Map Reduce, Naiad and several other frameworks. We've found Spark to be temperamental, hard to configure, and with wildly varying performance, suited only to a small set of computations for which in-memory state reuse is beneficial (mostly it isn't).

In nearly every test Naiad has beaten Spark.

More info on Naiad: http://research.microsoft.com/en-us/projects/naiad/

[+] frak_your_couch|11 years ago|reply

If you are interested in this, you might be interested in my (warning: shameless plug) 5 part blog series located at http://blog.caseystella.com/pyspark-openpayments-analysis.ht.... I'm using the python bindings for Spark to illustrate doing data analysis on healthcare financial data on Hadoop.

[+] virmundi|11 years ago|reply

Spark is nice, but its memory model almost requires a full cluster overhaul. We looked at it at my last project. Our cluster nodes only had 64 GB of RAM. That was carved into 4 GB workers. In order to use Spark we'd have to halve our number of workers because of the memory requirements.

Neat project. Has its place. Requires a different cluster configuration which might limit its utility.

[+] sitkack|11 years ago|reply

You need to deploy your MapReduce cluster with Mesos, allowing both Spark and MR to use the cluster at the same time.

[+] krigi|11 years ago|reply

I've just start using this at work. It's far easier to jump into than MapReduce; orders a magnitude easier. Hopefully I'll be able to contribute back to the project at some point.

[+] markivraknatap|11 years ago|reply

Good work. Love the title for your blog too :)

13 comments