Data Build Tool makes it easier to get data analysts involved, which helps bring more analytics code into version control. This is a really really good thing given that most data analysts are not software engineers and rightfully prioritize putting their analytical skills to use in the most efficient way possible, which tends to result in analysis getting lost in a data swamp.
I made a little animated slide deck (https://dbt.picturatechnica.com/) a while back about dbt for data analyst type folks that want an intro. You can slowly step through each slide to get a feel of what dbt is going to build in your datawarehouse.
I use dbt to build small dremio data warehouse. dbt is using software engineering ideas to build data, it is quite awesome. I can use github to version control it, can did some unit test (data quality check mainly).
Another thing is its macro systems. Although Macros are not a good programming language. But it makes integrate new DBMS much easier, because most 3rd party plugins are mainly written in Macros. So, Macros is good too if it is provided by other people (not need to be written by myself :) )
The issue with dbt is now folks end up needing to spin up the entire repository to run a bunch of sql queries. The issue is this is a tool that’s targeted at teams that don’t have rigorous engineering backgrounds so it’s bound to be used in suboptimal ways. If the dbt team can do things about it, it might be even better.
Before dbt, I didn't even really think about structuring how I'd be running my queries. My experience with it has been - it's made it clear to me what has been sub-optimally laid out. Organize my fact tables by subject, adopt a common schema before applying business logic, utilize re-usable views instead of copy-pasting CTEs, etc.
And I like its tight git integration. The alternative, quite often, is a folder of scripts on an analyst's desktop.
My team have been working to solve this problem (and more). We recently released our CLI tool, palm, as OSS. Along with a plugin, palm-dbt, which we designed specifically for working with dbt projects in an optimal way.
Palm requires running code in containers, so teams can ensure dependencies and versions are correct. Then the plugin provides sensible defaults, with solutions for common problems (like idempotent dev/ci runs) and an abstraction in front of the most common commands you need to run in the container.
It's been a huge boon for our analysts productivity, improved the reliability and quality of our models, and decreased onboarding times for new people joining our data team.
I'm the CTO of a 15 person startup and dbt (along with Mode, Stitch and BigQuery) has been a superpower allowing me to moonlight as a CDO (or some shitty pastiche of one) and push forward our data usage and competence in ways I just didn't think possible. Modern data stack is a marvel in what it unlocks.
I can't argue that dbt isn't great -- it is. It is, however
, unfortunate that Python is still the dominate lingua franca here. Between all the Python data tools and JSON being shuffled around ala singer.io, I just can't help but think there is a huge opportunity awaiting the engineers who build a company around an open source set of products using rigorous languages and schemas.
Does it matter that dbt is written in Python? dbt models are still SQL at heart. Sure there's the addition of Jinja to enable references for data lineage and configuration, but it's all compiled to SQL with fine-grained control over what's produced.
Forgive me if I come across as combative, but I don't understand generic appeals to things like a language being rigorous. Rigorous to what end? What problem is it solving where that is required? If you know something specific in this domain where that level of rigor is needed, why not share what it is?
There are a lot of problems in the analytics space (and a lot of opportunity for new practices, tools, and businesses), but I would argue that at the end of the day the primary issue is whether or not data producers choose to model data such that it is legible outside of the system that produced it much more than it is about any particular language or methodology.
Well part of the benefit is rapid development; it's mind-boggling how quickly someone can stand up a dbt project and begin to iterate on transforms. Using Python/SQL/JSON (at small/medium) scales keeps the data stack consistent and lowers the barrier to entry. No reason to prematurely optimize when your bottleneck is the modeling and not the actual data volume.
A vast majority of companies are working with < 1TB of data that sits neatly in a single cloud database. Python and tools like dbt are fantastic for a huge class of problems without compromising workflow velocity, and pushing transformations into SQL removes most Python-bound performance constraints.
Changing singer.io to require Thrift or protobuf schemas isn't going to add the value you think it is. How data is shuffled between systems is considerably less important and time consuming than figuring out how to put that data to work.
Our own approach is to keep singer.io et al and JSON (flexible! streaming capable!), but deeply invest in JSON Schema to a) statically infer and map into other kinds of schema on your behalf -- TypeScript, Elastic Search, and SQL DDL so far -- and b) optimize validation so it's fast enough to be in the high-scale critical path. If you validate every document before it's read or written by your hacky transform, you've gone a long way towards limiting the bug blast radius.
I am a user, and what sets them apart even beyond the product is the amazing community they have built that celebrates the unsexy aspects of data science. Just watch some videos from their annual conference, people are so excited to talk about rationalizing business logic
One of the better ways I've seen dbt described is very similar to an MVC framework for data (I didn't see that mentioned yet, if it has been sorry I missed it).
Where a rough mapping would be:
model >> source
controller >> ephemeral model
view >> materialized (view, table, incremental) model
Like many MVCs it provides abstractions and helpers for common tasks such as performing incremental updates, snapshotting time series, common manipulations like creating date spines. Like many MVCs it supports a robust ecosystem of plugins that make it easy to re-use stable transforms for common datasets. Where Django passes you the request object instead of hand-parsing http responses, dbt allows you to loop through a set of derived schemas and dry out your SQL code.
You would generally use dbt as a transform framework for the same reason you'd use Ruby on Rails or Django etc as a web framework - because it provides you with a ton of otherwise repetitive non-differentiating code pre-baked and ready to go. You could keep a folder of sql files you arbitrarily run, and you could roll your own web framework from scratch. Personally I wouldn't do those things.
I recently applied for a job in a data warehousing team that was using this. The way they were using it replace their transformation stage from legacy tools like SSIS,DataStage and Informatica, It was certainly impressive and looked great to work with. I can see dbt becoming a lot more common moving forwards.
Is there dbt competition? I know of https://dataform.co that looked pretty similar, but that got bought by Google so I don't know if it would be a good investment now.
dbt and ELT in general are such a game-changer for allowing rapid iteration on business logic by data analysts; the feedback loop feels much more like "normal" software engineering compared to legacy systems.
Having used it for a little over a year now, I can say that it's strengths may lie in getting junior developer's code more free of bugs and dependency issues. It's just additional overhead when you're trying to do anything more complex, like a Kimball type II slowly changing dimension - then it's just a blocker. Unfortunately, as it becomes a defacto build environment, it's limitations start getting applied to everyone.
dbt is the quickest way to watch your cloud costs skyrocket.
I really do hope people don’t think this is a way to not have dedicated data engineers on staff and instead “empower” analysts to do things for themselves.
Cloud costs increasing could be a sign of more data being utilized productively. In any way, a mixture of data engineers and analysts is a healthy way to scale a dbt project to increase speed-of-delivery of analytics requests vs cloud costs. We at dbt Labs encourage the "analytics engineering" mindset to bring software engineering best practices into the dbt analytics workflow (git version control, code reviews), and so cloud cost considerations should be incorporated into mature dbt development practices.
I’m still a little confused on what dbt does after reading the article. Is it like Trino that generates materialized views as output, with built-in version control and other features?
In your case, dbt would be the tool to manage your Trino view programmatically:
* You define the sql queries that select the input data as models and the dbt scripts (also sql) to combine/manipulate the models.
* On running, dbt will generate the Trino SQL queries to join/transform these models into the final result (also a model).
* Depending on your configuration, any models can be stored as a Trino view or it can be materialized as a data table that's updated every time you re-run the dbt scripts.
I'd list the DAG and testing as core features as well, but yeah, basically. In a very transparent and non-magical way (I despise magical data solutions). If that summary doesn't resonate with you, it's probably not the tool for you. No need to force it.
I'm not familiar with Trino at all, but that sounds like a specific database. dbt is not a database, it is tooling for working with databases.
We’ve recently introduced dbt, fivetran and snowflake and it has been amazing on many levels. Speed of development along with CI/CD has been fantastic. Clarity of thought about organizing our work has helped with the more junior and platform specific people.
Before I got into it I was sceptical because I was thinking It’s just SQL and SQL is pretty easy and straightforward. It breaks bad habits in DW development.
I’m not sure I still get why but it has been one of the few things Ive seen in a long time in this area that has been a big jump in capabilities. There is so much more confidence on my team. We are moving much faster. We can just get on with delivering without having to undo and think of everything is precious or dangerous to change.
It took some hands on work and getting over the initial mind bend to get there but I wouldn’t go back to what we had. I would likely only make my next move if they had a similar setup in place or were open to change.
The beautiful (and scary thing) about cloud data warehouses is that you can scale them significantly, which means bad SQL mostly becomes a cost issue as opposed to a speed issue.
Also, dbt makes it easier to persist tables instead of views, which has a massive performance improvement.
[+] [-] beckingz|4 years ago|reply
[+] [-] jerye|4 years ago|reply
P.s. I currently work for dbt Labs.
[+] [-] pqyzwbq|4 years ago|reply
Another thing is its macro systems. Although Macros are not a good programming language. But it makes integrate new DBMS much easier, because most 3rd party plugins are mainly written in Macros. So, Macros is good too if it is provided by other people (not need to be written by myself :) )
[+] [-] benjaminwootton|4 years ago|reply
A demo of integrating DBT and Snowflake https://timeflow.systems/connecting-dbt-to-snowflake
How DBT DevOps enables data teams: https://timeflow.academy/blog/how-dbt-devops-enables-data-te...
I am also writing a tutorial on DBT, though there are a few rough edges still: https://timeflow.academy/dbt-introduction
[+] [-] atentaten|4 years ago|reply
This link returns a 404
[+] [-] ramraj07|4 years ago|reply
[+] [-] adeelk93|4 years ago|reply
And I like its tight git integration. The alternative, quite often, is a folder of scripts on an analyst's desktop.
[+] [-] bearfood|4 years ago|reply
My team have been working to solve this problem (and more). We recently released our CLI tool, palm, as OSS. Along with a plugin, palm-dbt, which we designed specifically for working with dbt projects in an optimal way.
Palm requires running code in containers, so teams can ensure dependencies and versions are correct. Then the plugin provides sensible defaults, with solutions for common problems (like idempotent dev/ci runs) and an abstraction in front of the most common commands you need to run in the container.
It's been a huge boon for our analysts productivity, improved the reliability and quality of our models, and decreased onboarding times for new people joining our data team.
Links: https://github.com/palmetto/palm-cli https://github.com/palmetto/palm-dbt
[+] [-] te_chris|4 years ago|reply
[+] [-] xyzzy_plugh|4 years ago|reply
It's a goddamn madhouse.
[+] [-] oroul|4 years ago|reply
Forgive me if I come across as combative, but I don't understand generic appeals to things like a language being rigorous. Rigorous to what end? What problem is it solving where that is required? If you know something specific in this domain where that level of rigor is needed, why not share what it is?
There are a lot of problems in the analytics space (and a lot of opportunity for new practices, tools, and businesses), but I would argue that at the end of the day the primary issue is whether or not data producers choose to model data such that it is legible outside of the system that produced it much more than it is about any particular language or methodology.
[+] [-] theplague42|4 years ago|reply
[+] [-] tomnipotent|4 years ago|reply
And what value does it add?
A vast majority of companies are working with < 1TB of data that sits neatly in a single cloud database. Python and tools like dbt are fantastic for a huge class of problems without compromising workflow velocity, and pushing transformations into SQL removes most Python-bound performance constraints.
Changing singer.io to require Thrift or protobuf schemas isn't going to add the value you think it is. How data is shuffled between systems is considerably less important and time consuming than figuring out how to put that data to work.
[+] [-] killingtime74|4 years ago|reply
[+] [-] jgraettinger1|4 years ago|reply
[+] [-] lbotos|4 years ago|reply
https://www.meltano.com/
[+] [-] smoyer|4 years ago|reply
[+] [-] amznbyebyebye|4 years ago|reply
[+] [-] jeanlaf|4 years ago|reply
[+] [-] lmm|4 years ago|reply
[+] [-] orzig|4 years ago|reply
[+] [-] 41b696ef1113|4 years ago|reply
[0] https://coalesce.getdbt.com
[+] [-] norton120|4 years ago|reply
model >> source
controller >> ephemeral model
view >> materialized (view, table, incremental) model
Like many MVCs it provides abstractions and helpers for common tasks such as performing incremental updates, snapshotting time series, common manipulations like creating date spines. Like many MVCs it supports a robust ecosystem of plugins that make it easy to re-use stable transforms for common datasets. Where Django passes you the request object instead of hand-parsing http responses, dbt allows you to loop through a set of derived schemas and dry out your SQL code.
You would generally use dbt as a transform framework for the same reason you'd use Ruby on Rails or Django etc as a web framework - because it provides you with a ton of otherwise repetitive non-differentiating code pre-baked and ready to go. You could keep a folder of sql files you arbitrarily run, and you could roll your own web framework from scratch. Personally I wouldn't do those things.
[+] [-] scomp|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] glogla|4 years ago|reply
Any other?
[+] [-] theplague42|4 years ago|reply
[+] [-] pkaye|4 years ago|reply
[+] [-] jorpal|4 years ago|reply
[+] [-] andygrd|4 years ago|reply
[+] [-] antruok|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] jgalt212|4 years ago|reply
That's an impressive number of highly paid people. Nice work on Dbt's people.
[+] [-] nojito|4 years ago|reply
I really do hope people don’t think this is a way to not have dedicated data engineers on staff and instead “empower” analysts to do things for themselves.
[+] [-] boxysean|4 years ago|reply
Cloud costs increasing could be a sign of more data being utilized productively. In any way, a mixture of data engineers and analysts is a healthy way to scale a dbt project to increase speed-of-delivery of analytics requests vs cloud costs. We at dbt Labs encourage the "analytics engineering" mindset to bring software engineering best practices into the dbt analytics workflow (git version control, code reviews), and so cloud cost considerations should be incorporated into mature dbt development practices.
[+] [-] mijoharas|4 years ago|reply
[+] [-] bxji|4 years ago|reply
[+] [-] dikei|4 years ago|reply
* You define the sql queries that select the input data as models and the dbt scripts (also sql) to combine/manipulate the models.
* On running, dbt will generate the Trino SQL queries to join/transform these models into the final result (also a model).
* Depending on your configuration, any models can be stored as a Trino view or it can be materialized as a data table that's updated every time you re-run the dbt scripts.
[+] [-] adeelk93|4 years ago|reply
I'm not familiar with Trino at all, but that sounds like a specific database. dbt is not a database, it is tooling for working with databases.
[+] [-] dikei|4 years ago|reply
[+] [-] lunatuna|4 years ago|reply
Before I got into it I was sceptical because I was thinking It’s just SQL and SQL is pretty easy and straightforward. It breaks bad habits in DW development.
I’m not sure I still get why but it has been one of the few things Ive seen in a long time in this area that has been a big jump in capabilities. There is so much more confidence on my team. We are moving much faster. We can just get on with delivering without having to undo and think of everything is precious or dangerous to change.
It took some hands on work and getting over the initial mind bend to get there but I wouldn’t go back to what we had. I would likely only make my next move if they had a similar setup in place or were open to change.
[+] [-] beckingz|4 years ago|reply
Also, dbt makes it easier to persist tables instead of views, which has a massive performance improvement.
[+] [-] ziga|4 years ago|reply
https://medium.com/@jthandy/how-compatible-are-redshift-and-...