top | item 45695902

(no title)

djhworld | 4 months ago

Interesting and fun

> Workers download, decompress, and materialize their shards into DuckDB databases built from Parquet files.

I'm interested to know whether the 5s query time includes this materialization step of downloading the files etc, or is this result from workers that have been "pre-warmed". Also is the data in DuckDB in memory or on disk?

discuss

philbe77|4 months ago

hi djhworld. The 5s does not include the download/materialization step. That parts takes the worker about 1 to 2 minutes for this data set. I didn't know that this was going on HackerNews or would be this popular - I will try to get more solid stats on that part, and update the blog accordingly.

You can have GizmoEdge reference cloud (remote) data as well, but of course that would be slower than what I did for the challenge here...

The data is on disk - on locally mounted NVMe on each worker - in the form of a DuckDB database file (once the worker has converted it from parquet). I originally kept the data in parquet, but the duckdb format was about 10 to 15% faster - and since I was trying to squeeze every drop of performance - I went ahead and did that...

Thanks for the questions.

GizmoEdge is not production yet - this was just to demonstrate the art of the possible. I wanted to divide-and-conquer a huge dataset with a lot of power...

philbe77|4 months ago

I've since learned (from a DuckDB blog) - that DuckDB seems to do better when the XFS filesytem. I used ext4 for this, so I may be able to get another 10 to 15% (maybe!).

DuckDB blog: https://duckdb.org/2025/10/09/benchmark-results-14-lts