top | item 28535579

(no title)

rpedela | 4 years ago

I think using a data warehouse as your data lake or lake house is optimal. Even for data that isn't relational. Storage is so cheap now and is decoupled from compute costs for several providers that I don't even give it a thought. You get a fast, scalable SQL interface which is still nice and useful for non-relational data. Then all, or most, of the transformations needed for analysis can be pure SQL using a tool like DBT. In my experience, it greatly simplifies the entire pipeline.

discuss

CRConrad|4 years ago

> pure SQL using a tool like DBT

I don't get it... Looks to me like DBT is a Python SQL wrapper / big library that among other things includes an SQL generator / something else like that -- but not "pure" SQL?

rpedela|4 years ago

DBT has two main innovations. First, everything is a SELECT statement and DBT handles all the DDL for you. You can handle DDL yourself if you have a special case too. Second, the ref/source macros build a DAG of all your models so you don't have to think about build order. There are other innovations but those are the main ones.

You can give it truly pure SQL in both models and scripts, and mixing in Jinja if you need it for dynamic models. But I'd recommend at least using ref/source.