(no title)
luizfelberti | 3 months ago
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
amluto|3 months ago
S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.
switchbak|3 months ago
It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.
tempest_|3 months ago
layoric|3 months ago
Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.
mrbungie|3 months ago
The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).
> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
That's absolutely right.
mrlongroots|3 months ago
You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.
Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.
justincormack|3 months ago
kccqzy|3 months ago
otterley|3 months ago
basilgohar|3 months ago
nijave|3 months ago
Scubabear68|3 months ago
The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.
bushbaba|3 months ago
BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.
dukodk|3 months ago
luizfelberti|3 months ago
In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.