top | item 44109784

(no title)

bitbang | 9 months ago

Why is the footer metadata not sufficient for this need? The metadata should contain the min and max timestamp values from the respective column of interest, so that when executing a query, the query tool should be optimizing its query by reading the metadata to determine if that parquet file should be read or not depending on what time range is in the query.

discuss

order

amluto|9 months ago

Because the footer metadata is in the Parquet file, which is already far too late to give an efficient query.

If I have an S3 bucket containing five years worth of Parquet files, each covering a few days worth of rows, and I tell my favorite query tool (DuckDB, etc) about that bucket, then the tool will need to do a partial read (which is multiple operations, I think, since it will need to find the footer and then read the footer) of ~500 files just to find out which ones contain the data of interest. A good query plan would be to do a single list operation on the bucket to find the file names and then to read the file or files needed to answer my query.

Iceberg and Delta Lake (I think -- I haven't actually tried it) can do this, but plain Parquet plus Hive partitioning can't, and I'm not aware of any other lightweight scheme that is well supported that can do it. My personal little query tool (which predates Parquet) can do it just fine by the simple expedient of reading directory names.

Jarwain|9 months ago

Maybe I'm misunderstanding something about how ducklake works, but isn't that the purpose of the 'catalog database'? To store the metadata about all the files to optimize the query?

In theory, going off of the schema diagram they have, all your files are listed in `data_file`, the timestamp range for that file would be in `file_column_stats`, and that information could be used to decide what files to _actually_ read based on your query.

Whether duckdb's query engine takes advantage of this is a different story, but even if it doesn't Yet it should be possible to do so Eventually.

dugmartin|9 months ago

This can also be done using row group metadata within the parquet file. The row group metadata can include the range values of ordinals so you can "partition" on timestamps without having to have a file per time range.

amluto|9 months ago

But I want a file per range! I’m already writing out an entire chunk of rows, and that chunk is a good size for a Parquet file, and that chunk doesn’t overlap the previous chunk.

Sure, metadata in the Parquet file handles this, but a query planner has to read that metadata, whereas a sensible way to stick the metadata in the file path would allow avoiding reading the file at all.

simlevesque|9 months ago

I wish we had more control of the row group metadata when writing Parquet files with DuckDB.