top | item 30997737

(no title)

theboat | 3 years ago

Thank you for open sourcing this. More competition in the budding metrics ecosystem is good for end users.

It seems like you think MetricFlow should be the data mart layer and not just the metrics layer. If that's true...why? Why would I join my fact and dimension tables in metricflow instead of in dbt? One of the value adds of dbt is that it centralizes business logic in a single place. Joins are business logic. The industry seems to be moving towards creating very wide data mart tables in dbt and surfacing them to the semantic layer 1:1, or building the metrics layer on top of them.

discuss

tlento|3 years ago

I'd say we think MetricFlow should be able to provide consistent, correct answers to reasonable queries end users of the metric model might ask. To do this across the various data warehouse layouts our users are likely to encounter we must necessarily provide support for dimensional joins. This doesn't mean MetricFlow should displace data mart services - to the contrary, I contend MetricFlow works best when layered on top of a warehouse built on centralized logic for managing its data layout. As an example, we generally push our customers to rely on the sql_table data source definition and push any sql_query constructs down to whatever warehouse management layers they have in place.

That said, you need to support joins, at least in some limited scope, in the semantic metric layer for it to be broadly useful. Consider this scenario - you have your dbt models producing wide tables for reasonable measure/dimension queries, and you have MetricFlow configs for the metric and dimension sets available in your data mart. Now imagine you've also got your finance team hooked up to a Google Sheets connector, and they're looking at revenue and unique customers by sales region. Cool, your wide table has that built in, no joins needed.

But what if they want something new? Let's say they want to know how they're doing against the target addressable market in each country. Should they have to submit a ticket to the data engineering team to add customer.country.market_size to your revenue table? Or should they be able to do "select revenue by customer__country__market_size" and get the report they need?

Our position is that we want to facilitate the latter - people getting what they need and knowing, as long as it's been defined properly in the model, that it's going to produce reasonable results. If your particular organization wants all of those joins run through a data mart ticket queue and surfaced as fully denormalized underlying tables that's fine by us, but most likely that's not what you want. You'd rather have some visibility into the types of joins people are requesting and then build out your data mart to more efficiently serve the requests people have on the ground, while still allowing them to ask new questions of the data without a long development feedback loop.