You are either being dishonest or just patently out of your mind if you think that you can query 100MB and 100PB in the same way. That's not even reasonable by HN standards of hyperbole. Do you have any idea how many orders of magnitude that is?
While it can handle 100 MB easily there probably are faster ways to handle that small amount of data. But yes, Spark can handle many PB and doesn't require a ton of changes in the code as you scale up from say 10 TB to 100 PB. The underlying cluster would change, and the performance profile would change a lot (10 TB can be done in-memory ... many PB, not so much)
I know. And I'm sorry, bitterly sorry, but I know that... no apologies I can make can alter the fact that in our thread you have been given a dirty, filthy, smelly piece of technical argument.
Which really isn't intended for 100 MB (I bet I could write a unix pipe & filter script that's faster than Spark), but is intended for 10 TB through several PB.
parasubvert|10 years ago
While it can handle 100 MB easily there probably are faster ways to handle that small amount of data. But yes, Spark can handle many PB and doesn't require a ton of changes in the code as you scale up from say 10 TB to 100 PB. The underlying cluster would change, and the performance profile would change a lot (10 TB can be done in-memory ... many PB, not so much)
dang|10 years ago
We're lucky that you didn't spark a horrible flamewar, but instead got patient, factual replies.
angrybits|10 years ago
MichaelGG|10 years ago
parasubvert|10 years ago
Which really isn't intended for 100 MB (I bet I could write a unix pipe & filter script that's faster than Spark), but is intended for 10 TB through several PB.