top | item 45923510

(no title)

Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.

discuss

justincormack|3 months ago

Network bandwidth is not 20x storage ant more. An SSD is around 10GB/s now, so similar to 100Gb ethernet.

mrlongroots|3 months ago

I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.

I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.

I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.

wtallis|3 months ago

And that's for one SSD. If you're running on a server rather than a laptop, aggregate storage bandwidth will almost certainly be higher than any single network link.