top | item 37223516

(no title)

ironchef | 2 years ago

> Look into datalake architectures.

Yup .. comfy with iceberg/delta/hudi

> RDBMS based data warehousing is obviously not economical at the petabyte scale.

I never said it was .. I'm simply responding to "I simply cannot understand how anyone chooses this over running your own Spark clusters with Jupyterlab". I'm trying to help you understand why folks would choose a SaaS over run your own.

> But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

No. You don't just pay for object storage + minor S3 read costs.

You pay for operations You pay for someone setting up conventions You pay to not have to optimize data layouts for streaming writes You pay to not have to discover race conditions in s3 when running multiple spark clusters writing to same delta tables You pay to not have to discover that your partitions/clustering needs have changed based on new data or query patterns

But look .. I get it. You have chosen to optimize for cost structures in one way .. and I've chosen to optimize in a different way. In the past I've done exactly as you've said as well. I think being able to seeking to see _why_ folks may have chosen a different path may help you understand other areas to consider in operations.

discuss

No comments yet.