(no title)
oconnore | 6 months ago
I think most people who need very near real-time queries also tend to need them to be transactional. The use case where you can accept inconsistent reads but something will break if you're 3 minutes out of date is very rare.
amluto|6 months ago
But the 3 minute thing seems somewhat immaterial to me. If I have a table with one billion rows, and I do an every-three-minute batch job that need to sync an average of one modified row to Iceberg, that job still needs write the correct deletion record to Iceberg. If there’s no index, then either the job writes a delete-by-key or the job need to scan 1B Iceberg rows. Sure, that’s doable in 3 minutes, but it’s far from free.
amluto|6 months ago
Replying again to add: cost. Just because you can do a batch update every few minutes by doing a full scan of the primary key column of your Iceberg table and joining against your list of modified or deleted primary keys does not mean you should. That table scan costs actual money if the Iceberg table is hosted somewhere like AWS or uses a provider like Databricks, and running a full column scan every three minutes could be quite pricey.