top | item 37209742

(no title)

ironchef | 2 years ago

Here was my situation. Occasional queries. Over a couple petabyte of data. Customer facing so response in seconds per SLA but > 95 percent of the time the warehouse isn’t running. Cached queries from within 24 hours which don’t require the warehouse to even spin up. Our snowflake costs were significantly less than an FTE.

Would that potentially be a situation which “running your own” doesn’t make sense?

discuss

order

ramesh31|2 years ago

>Would that potentially be a situation which “running your own” doesn’t make sense?

Look into datalake architectures. RDBMS based data warehousing is obviously not economical at the petabyte scale. But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

ironchef|2 years ago

> Look into datalake architectures.

Yup .. comfy with iceberg/delta/hudi

> RDBMS based data warehousing is obviously not economical at the petabyte scale.

I never said it was .. I'm simply responding to "I simply cannot understand how anyone chooses this over running your own Spark clusters with Jupyterlab". I'm trying to help you understand why folks would choose a SaaS over run your own.

> But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

No. You don't just pay for object storage + minor S3 read costs.

You pay for operations You pay for someone setting up conventions You pay to not have to optimize data layouts for streaming writes You pay to not have to discover race conditions in s3 when running multiple spark clusters writing to same delta tables You pay to not have to discover that your partitions/clustering needs have changed based on new data or query patterns

But look .. I get it. You have chosen to optimize for cost structures in one way .. and I've chosen to optimize in a different way. In the past I've done exactly as you've said as well. I think being able to seeking to see _why_ folks may have chosen a different path may help you understand other areas to consider in operations.

agent281|2 years ago

If you have petabytes of data, I don't think this article is talking about your use case.

sanderjd|2 years ago

I think it is?

Or I guess, what data size do you think it's talking about? If you only have gigabytes of data, none of this matters, you can use anything pretty cheaply and easily. So is this article just for "terabytes" or does it go up to "hundreds of terabytes" but not "petabytes"?