Pg_lake: Postgres with Iceberg and data lake access

boshomi|3 months ago

Why not just use Ducklake?[1] That reduces complexity[2] since only DuckDB and PostgreSQL with pg_duckdb are required.

[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o

mslot|3 months ago

DuckLake is pretty cool, and we obviously love everything the DuckDB is doing. It's what made pg_lake possible, and what motivated part of our team to step away from Microsoft/Citus.

DuckLake can do things that pg_lake cannot do with Iceberg, and DuckDB can do things Postgres absolutely can't (e.g. query data frames). On the other hand, Postgres can do a lot of things that DuckDB cannot do. For instance, it can handle >100k single row inserts/sec.

Transactions don't come for free. Embedding the engine in the catalog rather than the catalog in the engine enables transactions across analytical and operational tables. That way you can do a very high rate of writes in a heap table, and transactionally move data into an Iceberg table.

Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.

There's also the interoperability aspect of Iceberg being supported by other query engines.

pgguru|3 months ago

Boils down to design decisions; see: https://news.ycombinator.com/item?id=45813631

swasheck|3 months ago

i have so desperately wanted to love and use ducklake, but have come across some issues with it in practice (pg catalog). they seem to have to do with maintenance activities and ducklake suddenly throwing http/400 errors on files it created. i’m not sure if it’s due to my write patterns (gather data from sources into a polars dataframe and insert into the ducklake table from the df) into the partitioned tables or something else.

it’s ok in dev/test and for me as the person in the team who’s enamored with duckdb, but it’s made the team experience challenging and so i’ve just kinda reverted to hive partitioned parquet files with a duckdb file that has views created on top of the parquet. attach that file as read only and query away.

i may work up a full example to submit as an issue but up until now too may other things are dominating my time.

ozgune|3 months ago

This is huge!

When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”

Crunchy’s Postgres extension is by far the most ahead solution in the market.

Huge congrats to Snowflake and the Crunchy team on open sourcing this.

gigatexal|3 months ago

Honestly. Just pay snowflake for the amazing DB and ecosystem it is. And then go build cool stuff unless your value add to customers is infra let them handle all that.

anentropic|3 months ago

When Snowflake bought Crunchy Data I was hoping they were going to offer a managed version of this

It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account

gajus|3 months ago

Man, we are living in the golden era of PostgreSQL.

unknown|3 months ago

[deleted]

NeutralCrane|3 months ago

I’m not a data engineer but work in an adjacent role. Is there anyone here who could dumb the use case down? Maybe an example of a problem this solves. I am struggling to understand the value proposition here.

ggregoire|3 months ago

> Maybe an example of a problem this solves.

Some service writes a lot of data in parquet files stored on S3 (e.g. logs), and now you want that data to be queryable from your application as if it was in postgres (e.g. near real-time analytics dashboard). pg_lake allows you to load these parquet files into postgres and query the data. You can also join that data with existing tables in postgres.

ayhanfuat|3 months ago

With S3 Table Buckets, Cloudflare R2 Data Catalog and now this, Iceberg seems to be winning.

fifilura|3 months ago

How do you use your data lake? For me it is much more than just storing data, it is just as much for crunching numbers in unpredictable ways.

And this is where postgres does not cut it.

You need some more CPU and RAM than what you pay for in your postgres instance. I.e. a distributed engine where you don't have to worry about how big your database instance is today.

wodenokoto|3 months ago

The point about a datalake is to separate computer and storage. Postgres isn’t a compute layer it’s an access layer.

Your compute asks Postgres “what is the current data for these keys?” Or “what was the current data as of two weeks ago for these keys?” And your compute will then download and aggregate your analytics query directly from the parquet files.

hamasho|3 months ago

I like data lakes and their SQL-like query languages. They feel like an advanced version of the "Everything is a file" philosophy.

Under "Everything is a file", you can read or manipulate a wide variety of information via simple, open/read/write() APIs. Linux provides APIs to modify system settings via filesystem. Get the screen brightness setting? `cat /sys/class/backlight/device0/brightness`. Update? `echo 500 > /sys/class/backlight/device0/brightness`. No need for special APIs, just generic file operations and the kernel handles everything.

FUSE (Filesystem in Userspace) provides even more flexibility by allowing user space programs to build their own drivers that handle any data operation via the filesystem. You can mount remote systems (via SSH) and google drive, and copying files is as easy as `cp /mnt/remote/data/origin /mnt/googledrive/data/`. Or using unique FUSE like pgfuse and redisfs, updating redis value by postgres DB data is just `cat /mnt/postgres/users/100/full_name > /mnt/redis/user_100_full_name`.

But filesystems are only good for hierarchical data while a lot of real world data is relational. Many FUSE software tries hard to represent inherently non-hierarchical data in a filesystem. Data lake allows to use SQL, the elegant abstraction for relational data, across different underlying data sources. They can be physically distant and have different structures. A lot of real world applications are just CRUD on relational data. You can accomplish much more much easier if those data are just a big single database.

dharbin|3 months ago

Why would Snowflake develop and release this? Doesn't this cannibalize their main product?

barrrrald|3 months ago

One thing I admire about Snowflake is a real commitment to self-cannibalization. They were super out front with Iceberg even though it could disrupt them, because that's what customers were asking for and they're willing to bet they'll figure out how to make money in that new world

Video of their SVP of Product talking about it here: https://youtu.be/PERZMGLhnF8?si=DjS_OgbNeDpvLA04&t=1195

kentm|3 months ago

It's not going to scale as well as Snowflake, but it gets you into an Iceberg ecosystem which Snowflake can ingest and process at scale. Analytical data systems are typically trending to heterogenous compute with a shared storage backend -- you have large, autoscaling systems to process the raw data down to something that is usable by a smaller, cheaper query engine supporting UIs/services.

mslot|3 months ago

When we first developed pg_lake at Crunchy Data and defined GTM we considered whether it could be a Snowflake competitor, but we quickly realised that did not make sense.

Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server.

Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres.

999900000999|3 months ago

It'll probably be really difficult to set up.

If it's anything like super base, your question the existence of God when trying to get it to work properly.

You pay them to make it work right.

enether|3 months ago

Do I understand it correctly that DuckDB would run embedded on the machine running Postgres (i.e through the extension), and this limits query processing ability to whatever that machine can comfortably handle?

What are the deployment implications if one wants to integrate this in production? Surely they'd need a much larger Postgres machine at a minimum.

Is there concern re: "hot neighbour" problems if the DuckDB queries get too heavy? How is that sort of issue potentially handled? I understood from another query that DuckDB is ran in a separate process, so there is room to potentially throttle it

darth_avocado|3 months ago

This is so cool! We have files in Iceberg that we then move data to/from to a PG db using a custom utility. It always felt more like a workaround that didn’t fully use the capabilities of both the technologies. Can’t wait to try this out.

max_streese|3 months ago

Two questions:

(1) Are there any plans to make this compatible with the ducklake specification? Meaning: Instead of using Iceberg in the background, you would use ducklake with its SQL tables? My knowledge is very limited but to me, besides leveraging duckdb, another big point of ducklake is that it's using SQL for the catalog stuff instead of a confusing mixture of files, thereby offering a bunch of advantages like not having to care about number of snapshots and better concurrent writes.

(2) Might it be possible that pg_duckdb will achieve the same thing in some time or do things not work like that?

mslot|3 months ago

(1) We've thought about it, no current plans. We'd ideally reimplement DuckLake in Postgres directly such that we can preserve Postgres transaction boundaries, rather than reuse the Ducklake implementation that would run in a separate process. The double-edged sword is that there's a bunch of complexity around things like inlined data and passing the inlined data into DuckDB at query time, though if we can do that then you can get pretty high transaction performance.

(2) In principle, it's a bit easier for pg_duckdb to reuse the existing Ducklake implementation because DuckDB sits in every Postgres process and they can call into each other, but we feel that architecture is less appropriate in terms resource management and stability.

oulipo2|3 months ago

Interesting! How does it compare with ducklake?

mslot|3 months ago

You could say

With DuckLake, the query frontend and query engine are DuckDB, and Postgres is used as a catalog in the background.

With pg_lake, the query frontend and catalog are Postgres, and DuckDB is used as a query engine in the background.

Of course, they also use different table formats (though similar in data layer) with different pros and cons, and the query frontends differ in significant ways.

An interesting thing about pg_lake is that it is effectively standalone, no external catalog required. You can point Spark et al. directly to Postgres with pg_lake by using the Iceberg JDBC driver.

dkdcio|3 months ago

I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things

pgguru|3 months ago

You create foreign tables in postgres using either the pg_lake_table wrapper or pg_lake_iceberg.

Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.

In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.

whalesalad|3 months ago

RDS really needs to make it easy to install your own PG modules.

anentropic|3 months ago

110% this!

harisund1990|3 months ago

This is cool to see! Looks like a compete against pg_mooncake which Databricks acquired. But how is this different from pg_duckdb?

lysecret|3 months ago

A usecase I see for this personally I have encountered a lot of “hot cache for some time then offload for historical queries” usecases which I have built by hand multiple times. This should be a great fit. E.g. write to Postgres then periodically offload to lakehouse and even query together (if needed). Very cool!

spenczar5|3 months ago

Very cool. One question that comes up for me is whether pg_lake expects to control the Iceberg metadata, or whether it can be used purely as a read layer. If I make schema updates and partition changes to iceberg directly, without going through pg_lake, will pg_lake's catalog correctly reflect things right away?

pgguru|3 months ago

We have some level of external iceberg table read-only support, but it is limited at the moment. See this example/caveat: https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/fil...

mslot|3 months ago

You can use it as a read layer for for specific metadata JSON URL or a table in a REST catalog. The latter got merged quite recently, not yet in docs.

flarco|3 months ago

For anyone looking to easily ingest data into a Postgres Wire compatible database, check out https://github.com/slingdata-io/sling-cli. Use CLI, YAML or Python to run etl jobs.

drchaim|3 months ago

More integrations are great. Anyway, the "this is awesome" moment (for me) will be when you could mix row- and column-oriented tables in Postgres, a bit like Timescale but native Postgres and well done. Hopefully one day.

gregw2|3 months ago

I want MPP HTAP where SQL inserts/COPYs store data in three(!) formats: - row-based (low latency insert, fast row-based indexed query for single-row OLTP) - columnar-based (slow inserts/updates, fast aggregates/projections) - iceberg-columnar-based (better OLAP price/performance and less lockin than native columnar) And for SELECTs the query engine picks which storage engine satisfies the query using some SQL extension like DB2 "WAITFORDATA" or TiDB @@tidb_read_staleness or MemSQL columnstore_latency and/or similar signalling for performance-vs-cost preference.

And a common permissioning/datasharing layer so I can share data to external and internal parties who can in turn bring their own compute to make their own latency choices.

pgguru|3 months ago

Hypertables definitely had the arrays columns auto-expanding with the custom node type. Not sure what else it would look like for what you describe.

That said, don't sleep on the "this is awesome" parts in this project... my personal favorite is the automatic schema detection:

``` CREATE TABLE my_iceberg_table () USING iceberg WITH (definition_from = 's3://bucket/source_data.parquet'); ```

unknown|3 months ago

[deleted]

lysecret|3 months ago

Nice does this also allow me to write to parquet from my Postgres table?

mslot|3 months ago

Yes, just COPY table TO 's3://mybucket/data.parquet'

Or COPY table TO STDOUT WITH (format 'parquet') if you need it on the client side.

lysecret|3 months ago

Ong yes it works this would have made my past job so much easier.

pjd7|3 months ago

This is awesome, I will be trying this out in the coming months. Its just made it to the top of my R&D shortlist for things that could massively simplify our data stack for a b2b saas.

inglor|3 months ago

This is really nice though looking at the code - a lot of the postgres types are missing as well a lot of the newer parquet logical types - but this is a great start and a nice use of FDW.

pgguru|3 months ago

Hi, what types are you expecting to see that aren't supported? I believe we had support for most/all builtin postgres types.

inglor|3 months ago

Also, any planned support for more catalogs?

fridder|3 months ago

I love this. There are definitely shops where the data is a bit too much for postgres but something like Snowflake would be overkill. Wish this was around a couple years ago lol

apexalpha|3 months ago

I’m not super into the Data sphere but my company relies heavily on Snowflake which is becoming an issue.

This announcement seems huge to me, no?!

Is this really an open source Snowflake covering most use cases?

taude|3 months ago

there's also plenty of other options for warehouse/compute processing of iceberg data storage.

I think this is a pretty big deal, though.

Snowflake does a lot more, though, especially around sharing data across company boundaries.

iamcreasy|3 months ago

Very cool! Was there any inherent limitation with postgresql or its extension system that forced pg_lake to use duckdb as query engine?

mslot|3 months ago

I gave a talk on that at Data Council, then still discussing the pg_lake extensions as part of Crunchy Data Warehouse.

https://youtu.be/HZArjlMB6W4?si=BWEfGjMaeVytW8M1

Also, nicer recording from POSETTE: https://youtu.be/tpq4nfEoioE?si=Qkmj8o990vkeRkUa

It comes down to the trade-offs made by operational and analytical query engines being fundamentally different at every level.

pgguru|3 months ago

DuckDB provided a lot of infrastructure for reading/writing parquet files and other common formats here. It also was inherently multi-threaded and supported being embedded in a larger program (similar to sqllite), so made it a good basis for something that could work outside of the traditional process model of Postgres.

Additionally, the postgres extension system supports most of the current project, so wouldn't say it was forced in this case, it was a design decision. :)

claudeomusic|3 months ago

Can someone dumb this down a bit for a non data-engineer? Hard to fully wrap my head around who this is/isn’t best suited for.

lysecret|3 months ago

One usecase we have (we built it ourselves) is to periodically offload data from Postgres to lake house partitioned data on GCS. The way I see it this can now be done with a single query. Another one is the other way around to use posters as a query engine or to merge offloaded data with your live data.

mberning|3 months ago

Does anyone know how access control works to the underlying s3 objects? I didn’t see anything regarding grants in the docs.

pgguru|3 months ago

Hi, one of the developers here. You define credentials that can access the S3 buckets and use those as DuckDB secrets, usually in an init script for pgduck_server. (You can see some examples of this in the testing framework.)

I'll see if we can improve the docs or highlight that part better, if it is already documented—we did move some things around prior to release.

mslot|3 months ago

There are Postgres roles for read/write access to the S3 object that DuckDB has access to. Those roles can create tables from specific files or at specific locations, and can then assign more fine-grained privileges to other Postgres roles (e.g. read access on a specific view or table).

chaps|3 months ago

I love postgres and have created my own "data lake" sorta systems -- what would this add to my workflows?

iamcreasy|3 months ago

If anyone from Supabase is reading, it would be awesome to have this extension!

logicartisan|3 months ago

It’s amazing to see Postgres growing into something this powerful

scirob|3 months ago

Crunchydata did it first :) but nice to get more options

mslot|3 months ago

It's the same team and same project :). Crunchy Data was acquired by Snowflake.

beoberha|3 months ago

Curious why pgduck_server is a totally separate process?

rmnclmnt|3 months ago

The README explains it:

> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.

pgguru|3 months ago

What has been pointed out from the README; also:

- Separation of concerns, since with a single external process we can share object store caches without complicated locking dances between multiple processes. - Memory limits are easier to reason about with a single external process. - Postgres backends end up being more robust, as you can restart the pgduck_server process separately.

dkdcio|3 months ago

from the README:

> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.

hamonrye|3 months ago

[deleted]

CJlll|3 months ago

[deleted]

patokkuyak|3 months ago

[deleted]

rizky05|3 months ago

[deleted]

hexo|3 months ago

Oh datalakes. The most ridiculous idea in data processing, right after data frames in python.

We've had this discussion like a week ago about how stupid is to use filesystem for this kind of data storage and here we go again. Actually i had to implement this "idea" in practice. What a nonsense.

118 comments