top | item 40618639

(no title)

xiasongh | 1 year ago

How does this compare to pg_analytics?

https://github.com/paradedb/pg_analytics

discuss

Another difference is that this solution uses parquet_fdw, which handles fast scans through Parquet files and filter pushdown via row group pruning, but doesn't vectorize the groupby / join operations above the table scan in the query tree (so you're still using the row-by-row PG query executor in the end).

pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).

wanderinglight|1 year ago

This is definitely something I intend to fix.

My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally. Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well. Thanks for the pointers.

wanderinglight|1 year ago

I looked into pg_analytics and some other solution like Citus before working on pg_analytica.

The key difference is solutions like pg_analytics completely swap out the native postgres row based storage for columnar storage.

Not using the default postgres storage engine means you miss outon a lot of battle tested, functionality like updating existing rows, deleting rows, transactional updates etc. Columnar stores are not suited for transactions, updates and deletes.

pg_analytica retains the existing Postgres storage and only exports a time delayed version of the table in columnar format. This way developers get the benefit of a transactional storage and fast analytics queries.