Another difference is that this solution uses parquet_fdw, which handles fast scans through Parquet files and filter pushdown via row group pruning, but doesn't vectorize the groupby / join operations above the table scan in the query tree (so you're still using the row-by-row PG query executor in the end).
pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).
My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally.
Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well.
Thanks for the pointers.
I looked into pg_analytics and some other solution like Citus before working on pg_analytica.
The key difference is solutions like pg_analytics completely swap out the native postgres row based storage for columnar storage.
Not using the default postgres storage engine means you miss outon a lot of battle tested, functionality like updating existing rows, deleting rows, transactional updates etc.
Columnar stores are not suited for transactions, updates and deletes.
pg_analytica retains the existing Postgres storage and only exports a time delayed version of the table in columnar format.
This way developers get the benefit of a transactional storage and fast analytics queries.
mildbyte|1 year ago
pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).
wanderinglight|1 year ago
My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally. Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well. Thanks for the pointers.
wanderinglight|1 year ago
The key difference is solutions like pg_analytics completely swap out the native postgres row based storage for columnar storage.
Not using the default postgres storage engine means you miss outon a lot of battle tested, functionality like updating existing rows, deleting rows, transactional updates etc. Columnar stores are not suited for transactions, updates and deletes.
pg_analytica retains the existing Postgres storage and only exports a time delayed version of the table in columnar format. This way developers get the benefit of a transactional storage and fast analytics queries.