I agree. The main reason I shared it is because I find it interesting as a library. I actually use it behind the scenes to build https://telemetry.sh. Essentially, I ingest JSON, infer a Parquet schema, store the data in S3 with a lookaside cache on disk, and then use DataFusion for querying.
Hugsun|1 year ago
threeseed|1 year ago
For example you can go through say 1% of your data and for each column see if you can coerce all of the values to a float, int, date, string etc. And then from there you can set the Parquet schema with proper types.