top | item 29424445

What Is Dbt and Why Are Companies Using It?

93 points| mooreds | 4 years ago |seattledataguy.substack.com

70 comments

order
[+] beckingz|4 years ago|reply
Data Build Tool makes it easier to get data analysts involved, which helps bring more analytics code into version control. This is a really really good thing given that most data analysts are not software engineers and rightfully prioritize putting their analytical skills to use in the most efficient way possible, which tends to result in analysis getting lost in a data swamp.
[+] jerye|4 years ago|reply
I made a little animated slide deck (https://dbt.picturatechnica.com/) a while back about dbt for data analyst type folks that want an intro. You can slowly step through each slide to get a feel of what dbt is going to build in your datawarehouse.

P.s. I currently work for dbt Labs.

[+] pqyzwbq|4 years ago|reply
I use dbt to build small dremio data warehouse. dbt is using software engineering ideas to build data, it is quite awesome. I can use github to version control it, can did some unit test (data quality check mainly).

Another thing is its macro systems. Although Macros are not a good programming language. But it makes integrate new DBMS much easier, because most 3rd party plugins are mainly written in Macros. So, Macros is good too if it is provided by other people (not need to be written by myself :) )

[+] benjaminwootton|4 years ago|reply
Not to steal the OPs thunder, but I have also created a few resources on DBT if they help.

A demo of integrating DBT and Snowflake https://timeflow.systems/connecting-dbt-to-snowflake

How DBT DevOps enables data teams: https://timeflow.academy/blog/how-dbt-devops-enables-data-te...

I am also writing a tutorial on DBT, though there are a few rough edges still: https://timeflow.academy/dbt-introduction

[+] ramraj07|4 years ago|reply
The issue with dbt is now folks end up needing to spin up the entire repository to run a bunch of sql queries. The issue is this is a tool that’s targeted at teams that don’t have rigorous engineering backgrounds so it’s bound to be used in suboptimal ways. If the dbt team can do things about it, it might be even better.
[+] adeelk93|4 years ago|reply
Before dbt, I didn't even really think about structuring how I'd be running my queries. My experience with it has been - it's made it clear to me what has been sub-optimally laid out. Organize my fact tables by subject, adopt a common schema before applying business logic, utilize re-usable views instead of copy-pasting CTEs, etc.

And I like its tight git integration. The alternative, quite often, is a folder of scripts on an analyst's desktop.

[+] bearfood|4 years ago|reply
Shameless, but very relevant plug:

My team have been working to solve this problem (and more). We recently released our CLI tool, palm, as OSS. Along with a plugin, palm-dbt, which we designed specifically for working with dbt projects in an optimal way.

Palm requires running code in containers, so teams can ensure dependencies and versions are correct. Then the plugin provides sensible defaults, with solutions for common problems (like idempotent dev/ci runs) and an abstraction in front of the most common commands you need to run in the container.

It's been a huge boon for our analysts productivity, improved the reliability and quality of our models, and decreased onboarding times for new people joining our data team.

Links: https://github.com/palmetto/palm-cli https://github.com/palmetto/palm-dbt

[+] te_chris|4 years ago|reply
I'm the CTO of a 15 person startup and dbt (along with Mode, Stitch and BigQuery) has been a superpower allowing me to moonlight as a CDO (or some shitty pastiche of one) and push forward our data usage and competence in ways I just didn't think possible. Modern data stack is a marvel in what it unlocks.
[+] xyzzy_plugh|4 years ago|reply
I can't argue that dbt isn't great -- it is. It is, however , unfortunate that Python is still the dominate lingua franca here. Between all the Python data tools and JSON being shuffled around ala singer.io, I just can't help but think there is a huge opportunity awaiting the engineers who build a company around an open source set of products using rigorous languages and schemas.

It's a goddamn madhouse.

[+] oroul|4 years ago|reply
Does it matter that dbt is written in Python? dbt models are still SQL at heart. Sure there's the addition of Jinja to enable references for data lineage and configuration, but it's all compiled to SQL with fine-grained control over what's produced.

Forgive me if I come across as combative, but I don't understand generic appeals to things like a language being rigorous. Rigorous to what end? What problem is it solving where that is required? If you know something specific in this domain where that level of rigor is needed, why not share what it is?

There are a lot of problems in the analytics space (and a lot of opportunity for new practices, tools, and businesses), but I would argue that at the end of the day the primary issue is whether or not data producers choose to model data such that it is legible outside of the system that produced it much more than it is about any particular language or methodology.

[+] theplague42|4 years ago|reply
Well part of the benefit is rapid development; it's mind-boggling how quickly someone can stand up a dbt project and begin to iterate on transforms. Using Python/SQL/JSON (at small/medium) scales keeps the data stack consistent and lowers the barrier to entry. No reason to prematurely optimize when your bottleneck is the modeling and not the actual data volume.
[+] tomnipotent|4 years ago|reply
> using rigorous languages and schemas

And what value does it add?

A vast majority of companies are working with < 1TB of data that sits neatly in a single cloud database. Python and tools like dbt are fantastic for a huge class of problems without compromising workflow velocity, and pushing transformations into SQL removes most Python-bound performance constraints.

Changing singer.io to require Thrift or protobuf schemas isn't going to add the value you think it is. How data is shuffled between systems is considerably less important and time consuming than figuring out how to put that data to work.

[+] killingtime74|4 years ago|reply
There are only languages people hate and ones no one use
[+] jgraettinger1|4 years ago|reply
Our own approach is to keep singer.io et al and JSON (flexible! streaming capable!), but deeply invest in JSON Schema to a) statically infer and map into other kinds of schema on your behalf -- TypeScript, Elastic Search, and SQL DDL so far -- and b) optimize validation so it's fast enough to be in the high-scale critical path. If you validate every document before it's read or written by your hacky transform, you've gone a long way towards limiting the bug blast radius.
[+] smoyer|4 years ago|reply
In the other hand, it's convenient to use the same language as Jupyter Notebooks.
[+] amznbyebyebye|4 years ago|reply
Can you give a before/after of what the desired state would look like?
[+] jeanlaf|4 years ago|reply
Might be worth looking into Airbyte!
[+] lmm|4 years ago|reply
Something like Databricks?
[+] orzig|4 years ago|reply
I am a user, and what sets them apart even beyond the product is the amazing community they have built that celebrates the unsexy aspects of data science. Just watch some videos from their annual conference, people are so excited to talk about rationalizing business logic
[+] norton120|4 years ago|reply
One of the better ways I've seen dbt described is very similar to an MVC framework for data (I didn't see that mentioned yet, if it has been sorry I missed it). Where a rough mapping would be:

model >> source

controller >> ephemeral model

view >> materialized (view, table, incremental) model

Like many MVCs it provides abstractions and helpers for common tasks such as performing incremental updates, snapshotting time series, common manipulations like creating date spines. Like many MVCs it supports a robust ecosystem of plugins that make it easy to re-use stable transforms for common datasets. Where Django passes you the request object instead of hand-parsing http responses, dbt allows you to loop through a set of derived schemas and dry out your SQL code.

You would generally use dbt as a transform framework for the same reason you'd use Ruby on Rails or Django etc as a web framework - because it provides you with a ton of otherwise repetitive non-differentiating code pre-baked and ready to go. You could keep a folder of sql files you arbitrarily run, and you could roll your own web framework from scratch. Personally I wouldn't do those things.

[+] scomp|4 years ago|reply
I recently applied for a job in a data warehousing team that was using this. The way they were using it replace their transformation stage from legacy tools like SSIS,DataStage and Informatica, It was certainly impressive and looked great to work with. I can see dbt becoming a lot more common moving forwards.
[+] glogla|4 years ago|reply
Is there dbt competition? I know of https://dataform.co that looked pretty similar, but that got bought by Google so I don't know if it would be a good investment now.

Any other?

[+] theplague42|4 years ago|reply
dbt and ELT in general are such a game-changer for allowing rapid iteration on business logic by data analysts; the feedback loop feels much more like "normal" software engineering compared to legacy systems.
[+] pkaye|4 years ago|reply
How long before Amazon incorporates this as one of their services?
[+] jorpal|4 years ago|reply
I’m sure they’re working on it. Databricks, however, also has Delta Live Tables coming out soon I believe.
[+] andygrd|4 years ago|reply
Having used it for a little over a year now, I can say that it's strengths may lie in getting junior developer's code more free of bugs and dependency issues. It's just additional overhead when you're trying to do anything more complex, like a Kimball type II slowly changing dimension - then it's just a blocker. Unfortunately, as it becomes a defacto build environment, it's limitations start getting applied to everyone.
[+] antruok|4 years ago|reply
The snapshots feature should handle the slowly changing dimensions. But are you saying it's not flexible enough for certain edge cases?
[+] jgalt212|4 years ago|reply
> 15,000 members of a Slack community.

That's an impressive number of highly paid people. Nice work on Dbt's people.

[+] nojito|4 years ago|reply
dbt is the quickest way to watch your cloud costs skyrocket.

I really do hope people don’t think this is a way to not have dedicated data engineers on staff and instead “empower” analysts to do things for themselves.

[+] boxysean|4 years ago|reply
[dbt Labs employee here]

Cloud costs increasing could be a sign of more data being utilized productively. In any way, a mixture of data engineers and analysts is a healthy way to scale a dbt project to increase speed-of-delivery of analytics requests vs cloud costs. We at dbt Labs encourage the "analytics engineering" mindset to bring software engineering best practices into the dbt analytics workflow (git version control, code reviews), and so cloud cost considerations should be incorporated into mature dbt development practices.

[+] bxji|4 years ago|reply
I’m still a little confused on what dbt does after reading the article. Is it like Trino that generates materialized views as output, with built-in version control and other features?
[+] dikei|4 years ago|reply
In your case, dbt would be the tool to manage your Trino view programmatically:

* You define the sql queries that select the input data as models and the dbt scripts (also sql) to combine/manipulate the models.

* On running, dbt will generate the Trino SQL queries to join/transform these models into the final result (also a model).

* Depending on your configuration, any models can be stored as a Trino view or it can be materialized as a data table that's updated every time you re-run the dbt scripts.

[+] adeelk93|4 years ago|reply
I'd list the DAG and testing as core features as well, but yeah, basically. In a very transparent and non-magical way (I despise magical data solutions). If that summary doesn't resonate with you, it's probably not the tool for you. No need to force it.

I'm not familiar with Trino at all, but that sounds like a specific database. dbt is not a database, it is tooling for working with databases.

[+] dikei|4 years ago|reply
DBT is really nice, but not a game changer since it's basically a SQL generator and only as good as the underlying SQL engine of your data warehouse.
[+] lunatuna|4 years ago|reply
We’ve recently introduced dbt, fivetran and snowflake and it has been amazing on many levels. Speed of development along with CI/CD has been fantastic. Clarity of thought about organizing our work has helped with the more junior and platform specific people.

Before I got into it I was sceptical because I was thinking It’s just SQL and SQL is pretty easy and straightforward. It breaks bad habits in DW development.

I’m not sure I still get why but it has been one of the few things Ive seen in a long time in this area that has been a big jump in capabilities. There is so much more confidence on my team. We are moving much faster. We can just get on with delivering without having to undo and think of everything is precious or dangerous to change.

It took some hands on work and getting over the initial mind bend to get there but I wouldn’t go back to what we had. I would likely only make my next move if they had a similar setup in place or were open to change.

[+] beckingz|4 years ago|reply
The beautiful (and scary thing) about cloud data warehouses is that you can scale them significantly, which means bad SQL mostly becomes a cost issue as opposed to a speed issue.

Also, dbt makes it easier to persist tables instead of views, which has a massive performance improvement.