top | item 42426602

Stream Processing with DuckDB/Polars?

4 points| Binomial-Dist | 1 year ago

I'm looking to do relatively simple streaming transformations on CDC data coming from Postgres. Dealing with a relatively small amount of data (1Ks to 10Ks of rows per minute, in most situations) making something like Flink + Kafka seem way overkill. Could engineer something custom that just builds on Postgres logical replication but would like something that just cleanly plugs in with DuckDB or Polars.

Doing some research a little bit surprised there isn't too much out there in terms of trying to solve for these relatively simple single-node processing situations. There are some things (e.g. pg_replicate) but would like something that's more oriented around the Arrow data ecosystem. Curious if anyone has either managed to build anything custom here that worked well, or any tools I'm missing.

3 comments

order

alclol|1 year ago

Two ideas you might like:

Debezium + Arrow Flight: Use Debezium as a library to grab PostgreSQL CDC events and stream them into Arrow for super-fast, columnar processing. Works great with Polars or DuckDB.

RisingWave: This is a lightweight stream processor that connects directly to Postgres CDC, lets you write SQL for transformations, and keeps everything updated in real-time. No Kafka or heavy setups required.

gulcin_xata|1 year ago

Have you seen pgstream? https://github.com/xataio/pgstream It is similar to pg_replicate and could be a good fit for streaming CDC data from Postgres. There is no built-in output plugin specifically for DuckDB but it might help you for building something lightweight and custom for your use case.