To my understanding, it does not cache (but I haven't reviewed the code). "Expensive" here really just means expensive in time. If you're running DuckDB on an EC2 instance to query S3 on the same account, it's practically free of any other cost. I wouldn't bother doing it outside AWS unless it's a one-time thing.
Running a simple analytics query on ~4B rows across 6.6K parquet files in S3 on an m6a.xl takes around 7 minutes. And you can "index" these queries somewhat by adding dimensions in the path (s3://my-data/category=transactions/month=2024-05/rows1.parquet) which DuckDB will happily query on. So yeah, fairly expensive in time (but cheap for storage!). If you're just firehosing data into S3 and can add somewhat descriptive dimensions to your paths, you can optimize it a bit.
If the parquet file includes any row group stats, then I imagine DuckDB might be able to use those to avoid scanning the entire file. It's definitely possible to request specific sections of a blob stored in S3. But I'm not familiar enough with DuckDB to know whether or not it does this.
I asked this some time ago on their Discord in relation to AWS lambda and the Python client and the answer was that you need to handle caching on your own but it is easy to do with fsspec. I haven’t tried it yet though.
Duckdb over vanilla S3 has latency issues because S3 is optimized for bulk transfers, not random reads. The new AWS S3 Express Zone supports low-latency but there's a cost.
Caching Parquet reads from vanilla S3 sounds like a good intermediate solution. Most of the time, Parquet files are Hive-partitioned, so it would only entail caching several smaller Parquet files on-demand and not the entire dataset.
orthecreedence|1 year ago
Running a simple analytics query on ~4B rows across 6.6K parquet files in S3 on an m6a.xl takes around 7 minutes. And you can "index" these queries somewhat by adding dimensions in the path (s3://my-data/category=transactions/month=2024-05/rows1.parquet) which DuckDB will happily query on. So yeah, fairly expensive in time (but cheap for storage!). If you're just firehosing data into S3 and can add somewhat descriptive dimensions to your paths, you can optimize it a bit.
davesque|1 year ago
chrisjc|1 year ago
Parquet pushdowns combined with Hive structuring is a pretty good combination.
There are some HTTP and Metadata caching options in DuckDB, but I haven't really figured out how and when they really making a difference.
akdor1154|1 year ago
ayhanfuat|1 year ago
wenc|1 year ago
Duckdb over vanilla S3 has latency issues because S3 is optimized for bulk transfers, not random reads. The new AWS S3 Express Zone supports low-latency but there's a cost.
Caching Parquet reads from vanilla S3 sounds like a good intermediate solution. Most of the time, Parquet files are Hive-partitioned, so it would only entail caching several smaller Parquet files on-demand and not the entire dataset.
elchief|1 year ago