top | item 30940188

(no title)

nicholashandel | 3 years ago

I think it’s probably best to talk about this comparison in three areas:

Semantics - The MetricFlow spec allows the construction of a much broader range of metrics with much less expression of logic or duplication of that logic than dbt or Cube.

Performance - MetricFlow generates queries that rivals the optimizations of a skilled Data Engineer and builds pre-aggregated tables similar to Cube while dbt builds a static query from a jinja macro.

Interfaces - Cube has some great interfaces for frontend developers, dbt just generates SQL at this point, and MetricFlow has a Python and CLI . The hosted version, Transform, comes with a SQL and GraphQL Interface but that is beyond the scope of the OSS project.

discuss

order

nicholashandel|3 years ago

If you’re interested, the longer version:

Semantics

MetricFlow has a less configuration relative to these other frameworks. We accomplish this by choosing abstractions that allow us to handle more on our side at query time through the DataFlow Plan builder. Working with the SQL constructions as a dataflow enables extensions such as non-dw data sources, or using other languages(Python) for some transformations.

The dbt spec is relatively new and requires a few extremely unDRY expressions. The most obvious is the lack of support for joins which means you simply won’t be able to answer most questions unless you build huge tables. There are a few other issues with the abstractions. For example, dimensions are defined multiple times across metrics. A few folks posted more about these challenges in their Github Issue but they’re sticking to their spec. I’m skeptical it will work at any scale.

The Cube concept is similar to Explores in Looker. They’re limiting because you end up with a bunch of representations of small domains within the warehouse and the moment you hit the edge of that domain you need to add a new Cube/Explore. This is not DRY and it’s frustrating. There is also no first-class object for Metrics which means you’re limited to to relatively simple metric types.

Performance

MetricFlow has the flexibility of the DataFlow Plan Builder and builds quite efficient queries. The Materialization feature allows you to build roll up tables programmatically to the data warehouse which could then be used as a low-latency serving layer.

dbt is a jinja macro and generates a static query per metric requested: [https://github.com/dbt-labs/dbt_metrics/blob/main/macros/get.... This macro will be quite hard to optimize for more complicated metric types. We struggled a ton with this before refactoring our framework to allow the manipulation and optimizations of these DataFlow Plans.

Cube is pretty slick on caching, but I know less about their query optimizations. They have some awesome pre-aggregation and caching features. I think this comes from their background in serving frontend interfaces.

Interfaces

MetricFlow supports a Python SDK and our CLI, today. Transform has a few more interfaces (SQL over JDBC, GraphQL, React) that sit outside the scope of this OSS project.

dbt only builds a query in the dbt context today. TBD what the dbt server does but I imagine it will expose a JDBC for paying customers.

Cube seems more focused on building custom data applications but has recently pivoted to the analytics front. I haven’t seen those interfaces in action but I’m curious to learn more there.