top | item 42032817

(no title)

snidane | 1 year ago

The architecture is sound - typically called ELT these days. Dump contents of upstream straight into a database and apply stateless and deterministic operations to achieve the final result tables.

SQL server is where this breaks though. You'll get yelled by DBAs for bad db practices like storing wide text fields without casting them to varchar(32) or varchar(12), primary keys on strings or no indexes at all, and most importantly taking majority of storage on the db host for tbese raw dumps. SQL Server and any traditional database scales by adding machines, so you end up paying compute costs for your storage.

If you use a shared disk system with decoupled compute scaling from storage, then your system is the way to go. Ideally these days dump your files into a file storage like s3 and slap a table abstraction over it with some catalog and now you have 100x less storage costs and about 5-10x increased compute power with things like duckdb. Happy data engineering!

discuss

WorldMaker|1 year ago

It amazes me how many DBAs think the limit on a varchar column impacts the disk space. The "on disk" size for `varchar(12)` and `varchar(32)` and `varchar(MAX)` are roughly the same and depends on the data itself more than the schema. That's what the "var" in "varchar" means: variable storage size. The limits like (32) were added for compatibility with `char` and for type-based "common sense" validation. Sure, it helps prevent footguns like accidental DDoS of ingesting too much data too quickly, but there are other ways to do that basic top-level validation of "is this too much data to insert?".

Five varchar(12) columns is more storage overhead than one varchar(60). There's a lot of great use cases for varchar(MAX) and everyone I ever had tell me that varchar(MAX) wasn't allowed didn't understand the internals of DB storage that they thought they did and somehow still believe in their internal model of the DB that varchar is just spicy char and fixed column size allocation.

icedchai|1 year ago

With Postgres, we mostly just use `text` everywhere, unless there is an actual reason to have a size limit.

In other news, I haven't seen a dedicated "DBA" at a company in over a decade.

sevensor|1 year ago

My experience with delta was that the catalog, being stored in s3 itself, was unacceptably slow, and for our data volume, Airflow was prohibitively expensive. We spent a lot of engineering time working around both problems. Which is funny because the consultants who advised us to do this told us it was the best possible solution; tailor made for our application, foolproof in every way. After that we proceeded to pay for their “data” “science” “services,” which went about as well as my scare quotes would suggest.

jamesblonde|1 year ago

You're basically describing the Lakehouse Tables architecture. Store your data as tabular data in Iceberg/Hudi/Delta on S3. Save a bucket on storage. Query with whatever engine you like (Snowflake, Redshift, BQ, DuckDB, etc).

aoeusnth1|1 year ago

Yes, this is the vast majority of my data work at Google as well. Spanner + Files on disk (Placer) + distributed query engine (F1) which can read anything and everything (even google sheets) and join it all.

It’s amazingly productive and incredibly cheap to operate.