Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?
duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).
There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.
I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.
DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.
I’m just getting into DuckDB lately and finding this feature so exciting. It’s a totally new paradigm. Such a great tool for scientists, and probably many other people. I wish I took it seriously sooner.
Flink is designed around streaming first, while Spark is built around batch first and you're likely best off selecting accordingly. Though any streaming application likely needs batch processing to some degree. Latency vs throughput.
mritchie712|9 months ago
There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.
0 - https://www.definite.app/blog/smallpond
isignal|9 months ago
robertlacok|9 months ago
tomjakubowski|9 months ago
steve_adams_86|9 months ago
winwang|9 months ago
Nate75Sanders|9 months ago
mgfist|9 months ago
lamp_book|9 months ago