top | item 43966388

(no title)

dogman123 | 9 months ago

This could be incredibly useful for me. Currently struggling to complete jobs with massive amounts of shuffle with Spark on EMR (large joins yielding 150+ billion rows). We use Glue currently, but it has become cost prohibitive.

discuss

order

threeseed|9 months ago

You should try using an S3 based shuffle plugin: https://github.com/IBM/spark-s3-shuffle

Then mount FSX for Lustre on all of your EMR nodes and have it write shuffle data there. It will massively improve performance and shuffle issues will disappear.

Is expensive though. But you can offset the cost now because you can run entirely Spot instances for your workers as if you lose a node there's no recomputation of the shuffle data.

winwang|9 months ago

Is the shuffle the biggest issue? Not too sure about joins but one of the datasets we're currently dealing with has a couple trillion rows. Would love to chat about this!