top | item 31006429

(no title)

tlento | 3 years ago

Ah, I see!

> With data sources, I mean if you have multiple services/products delivering data (e.g. an analytics platform, a CRM, all hosted on different servers).

MetricFlow does not support this today. The model we are working with is of an analytics team relying on a centralized data warehouse service - hence the initial support for Redshift/Snowflake/BigQuery instead of Postgres/MySQL.

Deriving data from multiple input services is actually very complicated, because at some point you need to solve the cross-service join (or union) problem. This requires us to either merge everything into a single service layer (which isn't really appropriate for MetricFlow, there are entire service packages dedicated to just this problem) or keep track of input data lineages throughout the metric model.

My recommendation, if you need this, is to get an ETL service to transfer data from these different data service layers into a unified warehouse and then use that as your MetricFlow input source. This will be cleaner and easier to manage for you in the long run, even though it adds a bit of cost up front.

> UXWizz is self-hosted, so you just have a basic MySQL/MariaDB database

Oh, nice, that was the bit I was missing. In this case we will support it as soon as someone adds support for a MySQL client, but you'll likely have to bypass all of the UXWizz bindings and connect to the MySQL instance directly.

> Oh, so this is like saving queries? Why would I write the MetricFlow config instead of saving the SQL query directly?

Yes, it is, and you probably wouldn't want to do this in MetricFlow.

MetricFlow's data source config allows you to specify a SQL query inline:

https://docs.transform.co/docs/metricflow/guides/best-practi...

The data source itself essentially gives MetricFlow a base "table"-like construct that we can query on behalf of the user. So you define measures (which are basically aggregations), dimensions (attributes used for grouping and filtering), and identifiers (which you can use to link to other dimensions).

Ideally this would all be stored in a table in your data service already, and you can just provide us with a pointer to the table identifier and use the measure/dimension/identifier elements to provide what amounts to a very simple view over the underlying SQL table. You'd do this less for its own sake and more because that is the basis of the consistent metric computation and simplified dimension access MetricFlow provides.

The SQL query construct is in place to allow you to do some lightweight manipulation of the input tables in MetricFlow, because sometimes people need to do a little bit of filtering or transformation and either can not or prefer not to do this at a lower layer. When placed inside of MetricFlow proper these tend to be extremely simple.

The example I gave about splitting a massive JSON blob table is quite extreme, and I should have been more clear about this - I would not recommend using MetricFlow to do it. That's better handled by something end users of the metric system never have to see in any way at all, whether it be handled by traditional ETL processing on ingestion from the telemetry system into the warehouse, or by a data mart definition layer like dbt, or by something in between like a stack of Airflow operators. If you're in MySQL and your volumes are small you could even use views and then reference the view in MetricFlow (at least in theory, we've never tried anything like this).

discuss

No comments yet.