top | item 43982617

(no title)

victorNicollet | 9 months ago

We are using binary formats in production, also for data visualization and analysis. We went for a simple custom format: in addition to the usual JSON types (string, number, array, boolean, object), the serialized value can contain the standard JSON types, but can also contain JavaScript typed arrays (Uint8Array, Float32Array, etc). The serialized data contains the raw data of all the typed arrays, followed by a single JSON-serialized block of the original value with the typed arrays replaced by reference objects pointing to the appropriate parts of the raw data region.

For most data visualization tasks, the dataset will be composed of 5% of JSON data and 95% of a small number of large arrays (usually Float32Array) representing data table columns. The JSON takes time to parse, but it is small, and the large arrays can be created in constant time from the ArrayBuffer of the HTTP response (on big-endian machines, this will be linear time for all except Uint8Array).

For situations where hundreds of thousands of complex objects must be transferred, we will usually pack those objects as several large arrays instead. For example, using a struct-of-arrays instead of an array-of-structs, and/or by having an Uint8Array contain the result of a binary serialization of each object, with an Uint32Array containing the bounds of each object. The objective is to have the initial parsing be nearly instantaneous, and then to extract the individual objects on demand: this minimizes the total memory usage in the browser, and in the (typical) case where only a small subset of objects is being displayed or manipulate, the parsing time is reduced to only that subset instead of the entire response.

The main difficulty is that the browser developer tools "network" tab does not properly display non-JSON values, so investigating an issue requires placing a breakpoint or console.log right after the parsing of a response...

discuss

order

pjmlp|9 months ago

Usually for such scenarios, in the context of distributed systems, I tend to rely on general purpose monitoring tools, or a quick and dirty mini gateway if I can't reach for them.