If you're looking to give Iceberg a spin, here's how to get it running locally, on AWS[0] and on GCP[1]. The posts use DuckDB as the query engine, but you could swap in Trino (or even chdb / clickhouse).
I think iceberg solves a lot of big data problems, for handling huge amounts of data on blob storage, including partitioning, compaction and ACID semantics.
I really like the way the catalog standard can decouple underlying storage as well.
My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,
Even DuckDB doesn’t support writing yet.
I built out a tool to stream data to iceberg which uses the python iceberg client:
Hidden partitioning is the most interesting Iceberg feature, because most of the very large datasets are timeseries fact tables.
I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks use date as a column (tpc-h) or join date as a dimension table (tpc-ds) and do not use timestamp ranges instead of dates.
I think this mischaracterizes the state of the space. Iceberg is the winner of this competition, as of a few months ago. All major vendors who didn't directly invent one of the others now support iceberg or have announced plans to do so.
Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.
The table on that page makes it look like all three of these are very similar, with schema evolution and partition evolution being the key differences. Is that really it?
I’d also love to see a good comparison between “regular” Iceberg and AWS’s new S3 Tables.
ClickHouse has a solid Iceberg integration. It has an Iceberg table function[0] and Iceberg table engine[1] for interacting with Iceberg data stored in s3, gcs, azure, hadoop etc.
right now, starrocks or trino are likely your best options, but all the major query engines (clickhouse, snowflake, databricks, even duckdb) are improving their support too.
What I like about iceberg is that the partitions of the tables are not tightly coupled to the subfolder structure of the storage layer (at least logically, at the end of the day the partitions are still subfolders with files), but at least the metadata is not tied to that, so you can change the partition of the tables going forward and still query a mix of old and new partitions time ranges.
In the other hand, since one of the use cases they created it at Netflix was to consume directly from real time systems, the management of the file creation when updates to the data is less trivial (the CoW vs MoR problem and how to compact small files) which becomes important on multi-petabytes tables with lots of users and frequent updates. This is something I assume not a lot companies put a lot of attention to (heck, not even at Netflix) and have big performance and cost implications.
It’s been on the up in recent years though as it appears to have won the format wars. Every vendor is rallying around it and there were new open source catalogues and support from AWS at the end of 2024.
I've been looking at Iceberg for a while, but in the end went with Delta Lake because it doesn't have a dependency on a catalog. It also has good support for reading and writing from it without needing Spark.
Does anyone know if Iceberg has plans to support similar use cases?
Iceberg has the hdfs catalog, which also relies only on dirs and files.
That said, a catalog (which Delta also can have) helps a lot to keep things tidy. For example, I can write a dataset with Spark, transform it with dbt and a query engine (such as Trino) and consume the resulting dataset with any client that supports Iceberg. If I use a catalog, all happens without having to register the dataset location in each of these components.
Why don't you want a catalog? The SQL or REST catalogs are pretty light to set up. I have my eye on lakekeeper[0], but Polaris (from Snowflake) is a good option too.
PyIceberg is likely the easiest way to write without Spark.
I’m doing datalake modernization for medium-large enterprise and spent last months in sales calls of MS Fabric vs Snowflake vs Databricks. All fun, but now with the managed Iceberg in AWS (S3 tables) I tend to consider to choose none of them: just plain Iceberg is good enough. Of course someone needs to write and read it; but there are so many good free options already, even build does not feel scary.
So I would go to the short side in Snowflake in medium-long term (looking their current value prop at least). Databricks has maybe more future as it has ML/AI-first approach. In short term we might still start with SF (with its Iceberg features), as the alternative future stack needs to mature and establish a bit.
Are there robust non-JVM based implementations for Iceberg currently? Sorry to say, but recommending JVM ecosystems around large data just feels like professional malpractice at this point. Whether deployment complexity, resource overhead, tool sprawl or operational complexity the ecosystem seems to attract people who solve only 50% of the problem and have another tool to solve the rest, which in turn only solves 50% etc.. ad infinitum. The popularity of solutions like Snowflake, Clickhouse, or DuckDB is not an accident and is the direction everything should go. I hear Snowflake will adopt this in the future, that is good news.
In order to get good query performance from Iceberg, we have to run compaction frequently. Compaction turns out to be very expensive. Any tip to minimize compaction while keeping queries fast?
Delta Lake is the main competitor. There's a lot of convergence going on, because everyone wants a common format and it's pretty clear what the desirable features are. Ultimately it becomes just boring infrastructure IMO.
It allows you to be query engine agnostic - query the same data via Spark, Snowflake or Trino.
Granted, performance may suffer vs Snowflake internal tables somewhat due to certain performance optimizations not being there.
Writing to catalogs is still pretty new. Databricks has recently been pushing delta-kernel-rs that DuckDb has a connector set up for, and there’s support for writing via Python with the Polars package through delta-rs. For small-time developers this has been pretty helpful for me and influential in picking delta lake over iceberg.
mritchie712|1 year ago
0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
1 - https://www.definite.app/blog/cloud-iceberg-duckdb
adesh_nalpet|1 year ago
Use it with Dropwizard/Springboot, you get to expose rest APIs too.
romperstomper|1 year ago
dm03514|1 year ago
I really like the way the catalog standard can decouple underlying storage as well.
My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,
Even DuckDB doesn’t support writing yet.
I built out a tool to stream data to iceberg which uses the python iceberg client:
https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...
gopalv|1 year ago
I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks use date as a column (tpc-h) or join date as a dimension table (tpc-ds) and do not use timestamp ranges instead of dates.
[1] - https://github.com/delta-io/delta/issues/490
fiddlerwoaroof|1 year ago
teleforce|1 year ago
[1] Open Table Formats:
https://www.starburst.io/data-glossary/open-table-formats/
Icathian|1 year ago
Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.
jl6|1 year ago
I’d also love to see a good comparison between “regular” Iceberg and AWS’s new S3 Tables.
pradeepchhetri|1 year ago
[0] https://clickhouse.com/docs/en/sql-reference/table-functions...
[1] https://clickhouse.com/docs/en/engines/table-engines/integra...
tlarkworthy|1 year ago
https://github.com/ClickHouse/ClickHouse/issues/52054
volderette|1 year ago
[1] https://www.starrocks.io/
macqm|1 year ago
https://trino.io/
adesh_nalpet|1 year ago
czwief|1 year ago
[1] https://www.starburst.io/platform/icehouse/
mritchie712|1 year ago
jl6|1 year ago
crorella|1 year ago
In the other hand, since one of the use cases they created it at Netflix was to consume directly from real time systems, the management of the file creation when updates to the data is less trivial (the CoW vs MoR problem and how to compact small files) which becomes important on multi-petabytes tables with lots of users and frequent updates. This is something I assume not a lot companies put a lot of attention to (heck, not even at Netflix) and have big performance and cost implications.
varsketiz|1 year ago
benjaminwootton|1 year ago
mrbluecoat|1 year ago
nikolatt|1 year ago
Does anyone know if Iceberg has plans to support similar use cases?
pammf|1 year ago
That said, a catalog (which Delta also can have) helps a lot to keep things tidy. For example, I can write a dataset with Spark, transform it with dbt and a query engine (such as Trino) and consume the resulting dataset with any client that supports Iceberg. If I use a catalog, all happens without having to register the dataset location in each of these components.
mritchie712|1 year ago
PyIceberg is likely the easiest way to write without Spark.
0 - https://github.com/lakekeeper/lakekeeper
apwell23|1 year ago
Is the query engine value add justify snowflake's valuation. Their data marketplace thing didn't seem to have actually worked.
jaakl|1 year ago
mkl95|1 year ago
npalli|1 year ago
juunpp|1 year ago
This actually converges to 1:
1/2 + 1/4 + 1/8 + 1/16 + ... = 1
You just need 30kloc of maven in your pom before you get there.
rdegges|1 year ago
chehai|1 year ago
vonnik|1 year ago
lmm|1 year ago
nxm|1 year ago
jmakov|1 year ago
jeffhuys|1 year ago
malnourish|1 year ago
dangoodmanUT|1 year ago
apwell23|1 year ago
honestSysAdmin|1 year ago
tolulade_ato|1 year ago
[deleted]
rubenvanwyk|1 year ago
Rhubarrbb|1 year ago
enether|1 year ago
nxm|1 year ago