top | item 44108363

(no title)

data_ders | 9 months ago

the manifesto [1] is the most interesting thing. I agree that DuckDB has the largest potential to disrupt the current order with Iceberg.

However, this mostly reads to me as thought experiment: > what if the backend service of an Iceberg catalog was just a SQL database?

The manifesto says that maintaining a data lake catalog is easier, which I agree with in theory. s3-files-as-information-schema presents real challenges!

But, what I most want to know is what's the end-user benefit?

What does someone get with this if they're already using Apache Polaris or Lakekeeper as their Iceberg REST catalog?

[1]: https://ducklake.select/manifesto/

discuss

peterboncz|9 months ago

https://x.com/peterabcz/status/1927402100922683628

it adds for users the following features to a data lake: - multi-statement & multi-table transactions - SQL views - delta queries - encryption - low latency: no S3 metadata & inlining: store small inserts in-catalog and more!

tishj|9 months ago

One thing to add to this: Snapshots can be retained (though rewritten) even through compaction

As a consequence of compaction, when deleting the build up of many small add/delete files, in a format like Iceberg, you would lose the ability to time travel to those earlier states.

With DuckLake's ability to refer to parts of parquet files, we can preserve the ability to time travel, even after deleting the old parquet files

anentropic|9 months ago

they say it's faster for one thing - can resolve all metadata in a single query instead of multiple HTTP requests