top | item 44845948

(no title)

mxmlnkn | 6 months ago

I concur with most of these arguments, especially about longevity. But, this only applies to smallish files like configurations because I don't agree with the last paragraph regarding its efficiency.

I have had to work with large 1GB+ JSON files, and it is not fun. Amazing projects such as jsoncons for streaming JSONs, and simdjson, for parsing JSON with SIMD, exist, but as far as I know, the latter still does not support streaming and even has an open issue for files larger than 4 GiB. So you cannot have streaming for memory efficiency and SIMD-parsing for computational efficiency at the same time. You want streaming because holding the whole JSON in memory is wasteful and sometimes not even possible. JSONL tries to change the format to fix that, but now you have another format that you need to support.

I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful. Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that without parsing everything in search of the closing bracket or quotes, accounting for escaped brackets and quotes, and nesting.

discuss

order

jerf|6 months ago

My rule of thumb that has been surprisingly robust over several uses of it is that if you gzip a JSON format you can expect it to shrink by a factor of about 15.

That is not the hallmark of a space-efficient file format.

Between repeated string keys and frequently repeated string values, that are often quite large due to being "human readable", it adds up fast.

"I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data."

One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards. The JSON can have offsets into the binary as necessary for identifying things or labeling whether or not it is compressed or whatever. This often largely mitigates the inefficiency concerns because if you've got a big pile of binary data the JSON bloat by percent tends to be much smaller than the payload; if it isn't, then of course I don't recommend this.

mxmlnkn|6 months ago

I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB.

The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.

jauntywundrkind|6 months ago

I see sooo many comments on this submission talking about large files. It feels massively over-relresented a concern to me.

On Linux, a good number of FS have builtin compression. My JSON all gets hit with lz4 compression automatically.

It indeed annoying having to go compress & decompress files before sending. It'd be lovely if file transfer tools (including messaging apps) were a bit better at auto-conpressing. I think with btrfs, it tests for compress ability too, will give up on trying to compress at some point: a similar effort ought be applied here.

The large file question & efficiency question feels like it's dominating this discussion, and it just doesn't seem particularly interesting or fruitful a concern to me. It shouldnt matter much. The computer can and should generally be able to eliminate most of the downsides relatively effectively.

zzo38computer|6 months ago

> I have had to work with large 1GB+ JSON files, and it is not fun.

I had also had to work with large JSON files, even though I would prefer other formats. I had written a C code to split it into records, which is done by keeping track of the nesting level and of whether or not it is a string and the escaping in a string (so that escaped quotation marks will work properly). It is not too difficult.

> I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful.

I agree, which is one reason I do not like JSON (I prefer DER). In addition to that, there is escaping text.

> Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that

With DER you can easily skip over any data.

However, I think the formats with type/length/value (such as DER) do not work as well for streaming, and vice-versa.

andreypopp|6 months ago

try clickhouse-local, it's amazing how it can crunch JSON/TSV or whatever at great speed