top | item 11678704

(no title)

oonny | 9 years ago

Is there a video of a real live example of how spark helped to solve a specific problem? I've tried quite a few times to get my head wrapped around what Spark helps you solve.

discuss

order

iskander|9 years ago

In theory, Spark lets you seamlessly write parallel computations without sacrificing expressivity. You perform collections-oriented operations (e.g. flatMap, groupBy) and the computation gets magically distributed across a cluster (alongside all necessary data movement and failure recovery).

In practice, Spark seems to perform reasonably well on smaller in-memory datasets and on some larger benchmarks under the control of Databricks. My experience has been pretty rough for legitimately large datasets (can't fit in RAM across a cluster) -- mysterious failures abound (often related to serialization, fat in-memory representations, and the JVM heap).

The project has been slowly moving toward an improved architecture for working with larger datasets (see Tungsten and DataFrames), so hopefully this new release will actually deliver on the promise of Spark's simple API.

oonny|9 years ago

Thanks for the reply but I was looking for a usecase. e.g. with spark i was able to do X. I don't even know where Spark would be applied to.

halflings|9 years ago

Here's a (simple) problem I solved with Spark:

I had hundreds of gigabytes of JSON logs with many variations in the schema and a lot of noise that had to be cleaned. There were also some joins and filtering that had to be done between each datapoint and an external dataset.

The data does not fit in memory, so you would need to write some special-purpose code to parse this data, clean it, do the join, without making your app crash.

Spark makes this straightforward (especially with its DataFrame API): you just point to the folder where your files are (or an AWS/HDFS/... URI) and write a couple of lines to define the chain of operations you want to do and save the result in a file or just display it. Spark will then run these operations in parallel by splitting the data, processing it and then joining it back (simplifying).

lmm|9 years ago

I don't know about videos but I used spark in my last job to solve problems of "we want to run this linear algebra calculation on x00000 user profiles and have it not take forever". For me the big selling point is it lets you write code that can be read as ordinary Scala but which runs on a cluster. As much as anything else it's practical to get the statistician to review the code and say "yes, that is implementing the calculation I asked you to implement" in a way that wouldn't be practical with more "manual" approaches to running calculations in parallel on a cluster.