top | item 31662713

Ask HN: How do you automate your data analytics report?

2 points| mohon | 3 years ago

I'm currently working on how to speed up our analytics report development workflow.

So imagine, you have this table called A with this structure

+----------+--------------+---------+----------+

+----------+--------------+---------+----------+

| TX | 1000 | 9000.0 | 8000.0 |

| FL | 1000 | 9000.0 | 8000.0 |

+----------+--------------+---------+----------+

then you want to have another table called B with this structure

+-------+--------------+---------+----------+

| age | order_count | gmv | net_gmv |

+-------+--------------+---------+----------+

| 20-30 | 1000 | 9000.0 | 8000.0 |

| 30-40 | 1000 | 9000.0 | 8000.0 |

| 40-50 | 1000 | 9000.0 | 8000.0 |

+-------+--------------+---------+----------+

The location and age are the dimension needed for the report, eventually we'll be having different dimension needed for our report. What we're doing now is we develop a Spark-SQL job for each table. But we think this is not gonna scale because every time we want to add new dimension, we need to develop the Spark-SQL job again (same logic but different group by dimension)

So I'm wondering whether there's a better way to do this. Anyone has any experience with this kind of problem before? Any pointer how to do this efficiently (I'm thinking someone could just specify the dimension they need and there'll be a script where it'll automatically generate the new table based on the specified dimension)

Thanks

9 comments

mattewong|3 years ago

There is a better way to do this.

An example of output that fits this paradigm you describe-- but to a much further degree-- would be the dozens of tables shown in the securities offering described at https://www.sec.gov/Archives/edgar/data/0001561167/000114420... search for "Stated Principal Balances of the Mortgage Loans as of the Cut-off Date" (on page A-3).

How do you generate reports like this in a manner that is flexible for end users without requiring IT in the middle? You start with the bare input, which is: data + report logic:

1. Specify the common columns that you want in your output tables (i.e. columns other than the first). In your example, that would be order_count, gmv, net_gmv

2. Separately, specify the tables that you want to generate, where each table spec consists of:

  - the bucket logic (in your example, that would be a formula representing "age broken out in buckets of 10")

  - optional other characteristics such as whether the data should be filtered before going into the table, which column the table should be sorted on, whether to limit the table output to the "top [10]" rows, etc etc

3. Third, run your data, plus the above spec, through some software that will generate your report for you

As for part 3, my company has recently launched a free platform for doing all of the above in a collaborative and secure manner. Please reach out if you'd like more info on this. Of course, you can do it yourself or have your IT do it-- but be aware it is not as easy as it sounds when you start having to deal with real-world practicalities like schema variability and scalability. And anyway, why bother if you can now do it all for free?

mohon|3 years ago

Thanks for the SEC reference!

Regarding the 1st point, let say I already have the process implemented in place. If I want to add new common column, is there a better way to easily add new common metrics/dimension without doing backfill?

hodgesrm|3 years ago

Hi,

You don't mention how much data you have, what the arrival rate is, and how long you would keep it.

You did mention you are familiar with ClickHouse. For datasets under a few billion rows you don't need any special materials views for specific dimensions. Put everything in a single table with columns for all variants and an identifying column for the type of data. Just ensure you make good choices on datatypes, compression, and primary key/sort order. In this case you can then just apply more compute to get good results.

ClickHouse can handle 1K columns without much trouble.

edit: clarify use of a single table.

mohon|3 years ago

Yes, we've done some benchmarks using Clickhouse with the same design that you just mentioned (1 single table that contains all relevant dimensions and metrics)

In our benchmark, we tried aggregating around 1 billion rows of raw data (2 months data) using count exact distinct -> could achieve around 50-60 seconds. If we use the HLL, the query can be finished around 20-30 seconds.

For the retention part, we're planning to keep it 1 year of data, so around 6 billion rows.

blakeburch|3 years ago

If I'm understanding your question correctly, it sounds like you're just generating cache tables for each variant of the root table with a different set of group bys (although I'm curious why you wouldn't want live views instead). I would recommend a python script that accepts a list of group by parameters. When executed, it loops through the list of parameters, generates and executes the DML with those parameters, then runs a generated query to store the results in the new table.

Feel free to message me separately if you want to dive into specifics.

mohon|3 years ago

Exactly. The reason I dont want to create a view is the view will incurr a lot of time being spent to just querying the same thing, making the user wait for it to finish (all of our tables are in Hive). If I use a materialized view or another table, the user just need to do 'select * from xxx' and get the report immediately.

I've experimented with OLAP engine such as Clickhouse and its view, so far looks good but need a lot of investment upfront to maintain it

I guess the way you mentioned is the only way we can do and we'll try to optimize from that

Cheers!

codevark|3 years ago

First World Problem. Help me find my webserver. I know it's in my basement.