top | item 45496722

(no title)

terrelln | 4 months ago

You'd have to tell OpenZL what your format looks like by writing a tokenizer for it, and annotating which parts are which. We aim to make this easier with SDDL [0], but today is not powerful enough to parse JSON. However, you can do that in C++ or Python.

Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.

However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.

[0] https://openzl.org/api/c/graphs/sddl/

discuss

kstenerud|4 months ago

I've done a binary representation of JSON-structured data that uses unary coding for variable length length fields: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

This tends to confuse generic compressors, even though the sub-byte data itself usually clusters around the smaller lengths for most data and thus can be quite repetitive (plus it's super efficient to encode/decode). Could this be described such that OpenZL can capitalize on it?