top | item 45926420

(no title)

faizshah | 3 months ago

I had to do something like this for a few TB of json recently. The unique thing about this workload was it was a ton of small 10-20mb files.

I found that clickhouse was the fastest, but duckdb was the simplest to work with it usually just works. DuckDB was close enough to the max performance from clickhouse.

I tried flink & pyspark but they were way slower (like 3-5x) than clickhouse and the code was kind of annoying. Dask and Ray were also way too slow, but dask’s parallelism was easy to code but it was just too slow. I also tried Datafusion and polars but clickhouse ended up being faster.

These days I would recommend starting with DuckDB or Clickhouse for most workloads just cause it’s the easiest to work with AND has good performance. Personally I switched to using DuckDB instead of polars for most things where pandas is too slow.

discuss

sagarm|3 months ago

Did you first ingest/convert this data to some other format, or did you operate directly on the JSON?