Joining a billion rows 20x faster than Apache Spark

banachtarski|9 years ago

Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.

quantumhobbit|9 years ago

This paper is a must read: https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daa...

People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.

jupiter90000|9 years ago

Different example, doing a simple 'group by' sparksql query on only about 20 million rows on a distributed phoenix/hbase table couldn't even be completed because of spark dumbly shuffling all the data around the cluster. Spark/phoenix RDD drivers apparently had no 'group by' push down support for phoenix so shuffled all the data amazingly inefficiently. Running the same query directly on phoenix took all of about a minute to finish.

My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.

jagsr123|9 years ago

From what I know, this test originally written by Databricks (expanded here) is meant to tease out the optimizations in the Tungsten engine. Of course, a distributed query that is dominated by shuffle costs will produce a very different result.

sumwale|9 years ago

The test uses 4 partitions (4 threads) and is not single-threaded. Of course, distributing over network will have network costs which can be significant for sub-second queries but will become insignificant for larger data and more involved queries. The Spark execution plan will be exactly same on laptop or in cluster, and these specific queries will scale about linearly.

filereaper|9 years ago

I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?

You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?

In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?

stingraycharles|9 years ago

I think there are many use cases. Fraud detection, risk analysis in finance, weather simulations, etc. These don't need to spill out to disk and are a perfect use case for these systems.

A friend of mine works for a company that does high speed weather analysis to make predictions for energy brokers, to predict prices of wind / solar energy on the market. They use these kind of systems extensively, because of the speed and volatility of the data. Fascinating stuff.

blhack|9 years ago

Maybe I'm misunderstanding the problem, but why can't you scale out horizontally?

If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?

Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.

alexchamberlain|9 years ago

It's not big data if it fits in memory... This article is demonstrating an architecture that may scale well with big data.

usgroup|9 years ago

Lol was hoping it was a combination of awk and paste :)

That always makes me chuckle.

Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.

pmlnr|9 years ago

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

Discussion is here: https://news.ycombinator.com/item?id=8908462

makapuf|9 years ago

Pardon my ignorance but what would you use Jenkins for ? Scheduling ?

EGreg|9 years ago

This seems like impressive stats about a relational database technology. But the scrolling on their website doesn't work on mobile. So in grand HN tradition, I left and now tell you all about it here, instead of the main point of their invention :)

franciscop|9 years ago

It worked for me but the nav of the browser didn't hide, which I recognize as messing around with absolute/fixed positioning and/or overflows. I'd recommend to use media queries to show a simple site on mobile and leave all the fancy stuff they are surely doing in the desktop for the desktop only.

Edit: on a second check, it might have to do with that nav that moves the whole page down.

Loic|9 years ago

What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?

jagsr123|9 years ago

Yes, it is a hash join.

alexchamberlain|9 years ago

Python 2.7 can do it in 0.0867 usec (Intel i7);

    $ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
    10000000 loops, best of 3: 0.0867 usec per loop

(Admittedly, I killed `n=109; sum(range(1,n+1))`.)

marknadal|9 years ago

Great article, actually. Typical HN comments on performance optimizations are complaints like "this isn't a real world use case" or things like that. Most of which, they miss that comparing baseline performance metrics against two systems is still genuinely interesting in and of by itself, and acts as a huge learning catalyst to understanding what is going on. I think this article did a great job of making an honest comparison and discussing what is going on, so props to the team! (We did something similar as well, where we compared cached read performance against Redis, and were 50X faster - here: https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2... ).

banachtarski|9 years ago

The problem is what "baseline" means. For example, a multithreaded program will always run slower than a single threaded one on one thread by definition. It has to do work in order to coordinate the threads. Obviously, this doesn't mean we avoid multithreaded code.

In this case, the software being tested was explicitly written to manage the coordination of data on many nodes, so why is the definition of "baseline" a single laptop? Seems specious.

Bedon292|9 years ago

I know its just a benchmark for comparison, and it is awesome. I love seeing cool comparisons like this, but why do I care that this particular benchmark is faster than Spark? What sort of analytics will be affected by this improvement, and will it actually be saving me time on real world use cases?

plamb|9 years ago

Our impression was that when Databricks released the billion-rows-in-one-second-on-a-laptop benchmark, readers were pretty awed by that result. We wanted to show that when you combine an in-memory database with Spark so it shares the same JVM/block manager, you can squeeze even more performance out of Spark workloads (over and above Spark 's internal columnar storage). Any analytics that require multiple trips to a database will be impacted by this design. E.g. workloads on a Spark + Cassandra analytics cluster will be significantly slower, barring some fundamental changes to Cassandra.

jagsr123|9 years ago

Good questions. One answer - Speed in analytics when working with large data sets matters. A lot. Think about this - several vendors seem to be claiming support for interactive analytics. i.e. i can ask an adhoc question on large volume of data and get some sort of insight in "interactive" times. Really? maybe with a thousand cores? In a competitive industry, say like in investment banking, if one can discern a pattern before the competition it simply provides an edge. Ask the question to folks involved in detecting fraud, or online portals trying to place an appropriate ad, or manufacturing plant trying to anticipate/predict faults. It isn't so much about trying to prove we are better than Spark (well other than grabbing some attention :-) ) but rather the potential to live in a world where batch processing is a thing of the past - like working with mainframes. The hope is that we can gain insight instantly. fwiw, we love Spark and expect some of these optimizations simply become part of Apache Spark.

luckydata|9 years ago

it's literally impossible to test every possible workload for a tool like Spark so... what's the point of asking? Stand up a testing cluster, run some of your jobs and you'll get the answer.

supergirl|9 years ago

why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?

zzleeper|9 years ago

In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

plamb|9 years ago

The font in the embedded gists or the font on the page?

83 comments