top | item 43985833

(no title)

isignal | 9 months ago

Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?

discuss

order

mritchie712|9 months ago

duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).

There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.

0 - https://www.definite.app/blog/smallpond

isignal|9 months ago

Thanks for the pointers!

robertlacok|9 months ago

I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.

tomjakubowski|9 months ago

DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.

steve_adams_86|9 months ago

I’m just getting into DuckDB lately and finding this feature so exciting. It’s a totally new paradigm. Such a great tool for scientists, and probably many other people. I wish I took it seriously sooner.

Nate75Sanders|9 months ago

Flink. It has more momentum than Spark right now.

mgfist|9 months ago

"momentum" is a tricky word. Zig has more momentum than C++, but will it ever overtake the language? I'd bet not.

lamp_book|9 months ago

Flink is designed around streaming first, while Spark is built around batch first and you're likely best off selecting accordingly. Though any streaming application likely needs batch processing to some degree. Latency vs throughput.