(no title)
ignoreusernames | 6 months ago
I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.
> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length
I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.
willtemperley|6 months ago
I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1].
The reality:
Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format.
It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps.
Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand.
[1] https://parquet.apache.org/docs/file-format/data-pages/colum...
quotemstr|6 months ago