top | item 43405907

(no title)

pacbard | 11 months ago

Apache Iceberg builds an additional layer on top of Parquet files that let's you do ACID transactions, rollbacks, and schema evolution.

A Parquet file is a static file that has the whole data associated with a table. You can't insert, update, delete, etc. It's just it. It works ok if you have small tables, but it becomes unwieldy if you need to do whole-table replacements each time your data changes.

Apache Iceberg fixes this problem by adding a metadata layer on top of smaller Parquet files (at a 300,000 ft overview).

discuss

pgwhalen|11 months ago

I knot you’re not OP, but and while this explanation is good, it doesn’t make sense to frame all this as a “problem” for parquet. It’s just a file format, it isn’t intended to have this sort of scope.

hobs|11 months ago

The problem is that the "parquet is beautiful" is extended all the time to pointless things - pq doesn't support appending updates so let's merge thousands of files together to simulate a real table - totally good and fine.

inkyoto|11 months ago

Well… when Parquet came out, it was the first necessary evolutionary step required to solve the lack of the metadata problem in CSV extracts.

So, it is CSV++ so to speak, or CSV + metadata + compact data storage in a singular file, but not a database table gone astray to wander the world on its own as a file.

victor106|11 months ago

> Apache Iceberg builds an additional layer on top of Parquet files that let's you do ACID transactions, rollbacks, and schema evolution.

Delta format also supports this, correct?

orthoxerox|11 months ago

Correct. They have feature parity, basically.