At Monte Carlo, we did some work on root cause analysis for data failures, like ETL job failures, timeouts, data delays, etc. I think there's a lot that can be done from a data science perspective to automate RCA, or provide better insights into data pipeline problems.
We put together this blog post, showing how an orchestration DAG (like a dbt schedule DAG) can be converted into a Bayesian network. You can then ask causal attribution questions in the form of conditional probability queries against the BN.
The idea is still pretty basic / preliminary, but I think it could be extended in all sorts of interesting ways e.g. attributing bad row-level data to upstream transformations, etc.
swordsmith8|3 years ago
We put together this blog post, showing how an orchestration DAG (like a dbt schedule DAG) can be converted into a Bayesian network. You can then ask causal attribution questions in the form of conditional probability queries against the BN. The idea is still pretty basic / preliminary, but I think it could be extended in all sorts of interesting ways e.g. attributing bad row-level data to upstream transformations, etc.
Would be interested to hear what people think.