just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell: spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
spark.read.parquet("sample_data").groupBy($"col").count().count()
after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.
bdndndndbve|1 year ago
ignoreusernames|1 year ago
I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk
> [...] which used to be a point of comparison to MapReduce specifically.
So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop
> Not ground-breaking nowadays but when I was doing this stuff 10+ years
I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating
hipadev23|1 year ago
https://people.csail.mit.edu/matei/papers/2010/hotcloud_spar...