top | item 40348057

(no title)

retakeming | 1 year ago

Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.

This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.

It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)

discuss

order

nitinreddy88|1 year ago

If users are already having datalake kind of system which is generating parquet files, the use case to use Postgres to query the data itself is questionable. I think having Postgres way of doing things should be prioritised if you want to keep your product in unique position.

epsilonic|1 year ago

Can you elaborate on what you mean by the "Postgres way of doing things"? Also, what is wrong with using Postgres to query data in external object stores? It is a common occurrence for businesses to store parquet artefacts in object storage, and querying them is often desirable.

philippemnoel|1 year ago

It depends. If you're happy with Databricks, etc. you might be good. But we've seen many users want the simplicity of querying data from Postgres for analytics, especially in case of JOINing both analytics and transactional data