Cool, but this is very specific to DataFusion, no? Is there any chance this would be standardized so other Parquet readers could leverage the same technique?
The technique can be applied by any engine, not just DataFusion. Each engine would have to know about the indexes in order to make use of them, but the fallback to parquet standard defaults means that the data is still readable by all.
But does data fusion publish a specification of how this metadata can be read, along with a test suite for verifying implementations? Because if they don't, this cannot be reliably used by any other impl
The Arrow/Parquet community is already discussing standardization via the Parquet format GitHub - this approach intentionally uses existing extension points in the format specification to remain compatible while the standardization discussions progress.
gdubya|7 months ago
aerzen|7 months ago
ethan_smith|7 months ago