top | item 40516990

(no title)

Does DuckDB cache the S3 downloads? Otherwise it could get pretty expensive, no?

discuss

To my understanding, it does not cache (but I haven't reviewed the code). "Expensive" here really just means expensive in time. If you're running DuckDB on an EC2 instance to query S3 on the same account, it's practically free of any other cost. I wouldn't bother doing it outside AWS unless it's a one-time thing.

Running a simple analytics query on ~4B rows across 6.6K parquet files in S3 on an m6a.xl takes around 7 minutes. And you can "index" these queries somewhat by adding dimensions in the path (s3://my-data/category=transactions/month=2024-05/rows1.parquet) which DuckDB will happily query on. So yeah, fairly expensive in time (but cheap for storage!). If you're just firehosing data into S3 and can add somewhat descriptive dimensions to your paths, you can optimize it a bit.

davesque|1 year ago

If the parquet file includes any row group stats, then I imagine DuckDB might be able to use those to avoid scanning the entire file. It's definitely possible to request specific sections of a blob stored in S3. But I'm not familiar enough with DuckDB to know whether or not it does this.

chrisjc|1 year ago

DuckDB can do some pushdowns to certain file formats like parquet, but every release seems to be getting better and better at doing it.

Parquet pushdowns combined with Hive structuring is a pretty good combination.

There are some HTTP and Metadata caching options in DuckDB, but I haven't really figured out how and when they really making a difference.

akdor1154|1 year ago

It does do that. I can't answer OP's qn about caching though.

ayhanfuat|1 year ago

I asked this some time ago on their Discord in relation to AWS lambda and the Python client and the answer was that you need to handle caching on your own but it is easy to do with fsspec. I haven’t tried it yet though.

wenc|1 year ago

Do you have any details on this?

Duckdb over vanilla S3 has latency issues because S3 is optimized for bulk transfers, not random reads. The new AWS S3 Express Zone supports low-latency but there's a cost.

Caching Parquet reads from vanilla S3 sounds like a good intermediate solution. Most of the time, Parquet files are Hive-partitioned, so it would only entail caching several smaller Parquet files on-demand and not the entire dataset.

elchief|1 year ago

it doesn't cache, nor does it respect HTTP_PROXY which is kind of annoying