Good introduction. Spark is really a project to watch in the data analysis field on distributed architecture. We had performed several benchmarks and Spark keeps its promisses. 2.5x faster comparing to Pig for the same algorithm on the same cluster.
For iterative algorithm with the in-memory possibilities, performances are really good comparing to Hadoop.
The project is still young with several bugs but the documentation is really good and the code is well commented and robust.
As part of our work we have done extensive comparisons of Spark on various workloads, clusters and cluster sizes comparing with Hadoop Map Reduce, Naiad and several other frameworks. We've found Spark to be temperamental, hard to configure, and with wildly varying performance, suited only to a small set of computations for which in-memory state reuse is beneficial (mostly it isn't).
If you are interested in this, you might be interested in my (warning: shameless plug) 5 part blog series located at http://blog.caseystella.com/pyspark-openpayments-analysis.ht.... I'm using the python bindings for Spark to illustrate doing data analysis on healthcare financial data on Hadoop.
Spark is nice, but its memory model almost requires a full cluster overhaul. We looked at it at my last project. Our cluster nodes only had 64 GB of RAM. That was carved into 4 GB workers. In order to use Spark we'd have to halve our number of workers because of the memory requirements.
Neat project. Has its place. Requires a different cluster configuration which might limit its utility.
I've just start using this at work. It's far easier to jump into than MapReduce; orders a magnitude easier. Hopefully I'll be able to contribute back to the project at some point.
[+] [-] jnaour|11 years ago|reply
For iterative algorithm with the in-memory possibilities, performances are really good comparing to Hadoop.
The project is still young with several bugs but the documentation is really good and the code is well commented and robust.
[+] [-] deadgrey19|11 years ago|reply
In nearly every test Naiad has beaten Spark.
More info on Naiad: http://research.microsoft.com/en-us/projects/naiad/
[+] [-] frak_your_couch|11 years ago|reply
[+] [-] virmundi|11 years ago|reply
Neat project. Has its place. Requires a different cluster configuration which might limit its utility.
[+] [-] sitkack|11 years ago|reply
[+] [-] krigi|11 years ago|reply
[+] [-] markivraknatap|11 years ago|reply