top | item 44894253

(no title)

oconnore | 6 months ago

This is a lot of fuss when you can get a batch update to stay within a few minutes of latency. You only have this problem if you are very insistent on both (1) very near real-time, and (2) Iceberg. And you can't go down this path if you require transactional queries.

I think most people who need very near real-time queries also tend to need them to be transactional. The use case where you can accept inconsistent reads but something will break if you're 3 minutes out of date is very rare.

discuss

amluto|6 months ago

What do you mean “transactional”? Do you mean that a reader never sees a state in Iceberg that is not consistent with a state that could have been seen in a successful transaction in Postgres? If so, that seems fairly straightforward: have the CDC process read from Postgres in a transaction and write to Iceberg in a transaction. (And never do anything in the Postgres transaction that could cause it to fail to commit.)

But the 3 minute thing seems somewhat immaterial to me. If I have a table with one billion rows, and I do an every-three-minute batch job that need to sync an average of one modified row to Iceberg, that job still needs write the correct deletion record to Iceberg. If there’s no index, then either the job writes a delete-by-key or the job need to scan 1B Iceberg rows. Sure, that’s doable in 3 minutes, but it’s far from free.

amluto|6 months ago

> This is a lot of fuss when you can get a batch update to stay within a few minutes of latency.

Replying again to add: cost. Just because you can do a batch update every few minutes by doing a full scan of the primary key column of your Iceberg table and joining against your list of modified or deleted primary keys does not mean you should. That table scan costs actual money if the Iceberg table is hosted somewhere like AWS or uses a provider like Databricks, and running a full column scan every three minutes could be quite pricey.