top | item 37589427

(no title)

zlurker | 2 years ago

We orchestrate our ETL pipelines with dagster. We only use duckdb in a few of them but are slowly replacing pandas etls with it. For some of our bigger jobs we use spark instead.

Essentially it's: 1. Data sources from places such as s3, sftp, rds 2. Use duckdb to load most of these with only extensions (I dont believe there's one for sftp, so we just have some python code to pull the files out.) 3. transform the data however we'd like with duckdb. 4. convert the duckdb table to pyarrow 5. Save to s3 with delta-rs

FWIW, we also have this all execute externally from our orchestration on an EC2 instance. This allows us to scale vertically.

discuss

quadrature|2 years ago

This is very cool!.

Last time I checked duckdb didn't have the concept of a metastore so do you have an internal convention for table locations and folder structure ?.

What do you use for reports/visualizations? notebooks ?.

zlurker|2 years ago

Yeah, dagster has a concept of metadata and assets so we have some code that'll map dagster's own logical representation to physical s3 locations.

Reports and viz varies a lot, the finance department uses tableau where as for more 'data sciencey' stuff we normally just use notebooks.