(no title)
sterlinm | 2 years ago
It would be great if DuckDB handled this itself, but it seems like to be competitive with Athena on really massive datasets, you need to have a metadata layer that is used to figure out which parquet files in S3 DuckDB actually needs to query and then potentially run those in parallel. This seems to be the architecture of Puffin (which I haven't personally tried using yet).
[1] https://www.boilingdata.com/ [2] https://boilingdata.medium.com/lightning-fast-aggregations-b... [3] https://github.com/sutoiku/puffin
One possible thing to look into would be whether this dataset is partitioned too much. My understanding is that the recommended file size for individual parquet files is 512MB to 1GB, whereas here they are 50MB. It would be interesting to see the impact of the partitioning strategy on these benchmarks.
[4] https://parquet.apache.org/docs/file-format/configurations/ [5] https://www.dremio.com/blog/tuning-parquet/
No comments yet.