top | item 44085682

(no title)

adelpozo | 9 months ago

I would add to make it streamable or at least allow to be read remotely efficiently.

discuss

order

flowerthoughts|9 months ago

Agreed on that one. With a nice file format, streamable is hopefully just a matter of ordering things appropriately once you know the sizes of the individual chunks. You want to write the index last, but you want to read it first. Perhaps you want the most influential values first if you're building something progressive (level-of-detail split.)

Similar is the discussion of delimited fields vs. length prefix. Delimited fields are nicer to write, but length prefixed fields are nicer to read. I think most new formats use length prefixes, so I'd start there. I wrote a blog post about combining the value and length into a VLI that also handles floating point and bit/byte strings: https://tommie.github.io/a/2024/06/small-encoding

lifthrasiir|9 months ago

I don't think a single encoding is generally useful. A good encoding for given application would depend on the value distribution and neighboring data. For example any variable-length scalar encoding would make vectorization much harder.