top | item 22746314

(no title)

marknadal | 6 years ago

Holy Cow! 2.5GB/s that is amazing.

Meanwhile I can barely get Chrome/NodeJS to parse 20MB in less than 100ms :(.

How useful (or useless) would Simdjson as a Native Addon to V8 be? I assume transferring the object into JS land would kill all the speed gains?

I wrote my own JSON parser just last week, to see if I could improve the NodeJS situation. Discovered some really interesting factoids:

(A) JSON parse is CPU-blocking, so if you get a large object, your server cannot handle any other web request until it finishes parsing, this sucks.

(B) At first I fixed this by using setImmediate/shim, but discovered to annoying issues:

(1) Scheduling too many setImmediates will cause the event loop to block at the "check" cycle, you actually have to load balance across turns in the event loop like so (https://twitter.com/marknadal/status/1242476619752591360)

(2) Doing the above will cause your code to be way slow, so a trick instead, is to actually skip setImmediate and invoke your code 3333 (some divider of NodeJS's ~11K stack depth limit) times or for 1ms before doing a real setImmediate.

(C) Now that we can parse without blocking, our parser's while loop (https://github.com/amark/gun/blob/master/lib/yson.js) marches X byte increments at a time (I found 32KB to be a sweet spot, not sure why).

(D) I'm seeing this pure JS parser be ~2.5X slower than native for big complex JSON objects (20MB).

(E) Interestingly enough, I'm seeing 10X~20X faster than native, for parsing JSON records that have large values (ex, embedded image, etc.).

(F) Why? This happened when I switched my parser to skip per-byte checks when encountering `"` to next indexOf. So it would seem V8's built in JSON parser is still checking every character for a token which slows it down?

(G) I hate switch statements, but woah, I got a minor but noticeable speed boost going from if/else token checks to a switch statement.

Happy to answer any other Qs!

But compared to OP's 2.5GB/s parsing?! Ha, mine is a joke.

discuss

order

bjoli|6 years ago

I did a small benchmark on machine machine last time simdjson was up for discussion and back then it was faster than /bin/cat on my machine

mianos|6 years ago

This comment was right at the bottom. It was so funny I just spit my coffee.

huhnmonster|6 years ago

I've also written and tried to optimize a hand-rolled JSON parser for exchange messages, just to see how fast pure JS could go. I tried many different things, but I only ever got near to the native implementation once I started assuming certain offsets in the buffer or optimistically parsing whole keys which were highly unsafe. My verdict was that you will never really get close to native, let alone close to hand-optimized C/C++.

wingi|6 years ago

The native parser is C++.

zbjornson|6 years ago

The interchange into v8 is indeed an issue, see another comment: https://news.ycombinator.com/item?id=22745941.

> JSON parse is CPU-blocking, so if you get a large object, your server cannot handle any other web request until it finishes parsing

Well, your CPU core is busy on one request or another, so I don't understand why this is an issue as long as you're guarding against maliciously large bodies. Blocking I/O is different because your core is partially idle while other hardware is doing async work. Using Node.js' cluster module lets you keep more cores busy. Chunking CPU-limited work increases total CPU time and memory required. (This is a pet peeve of mine and a hill I'm willing to die on :-) .)

marknadal|6 years ago

I think that is a good hill to die on, tho I would rather prioritize UX (browser not freezing) and server responsiveness. Ideally we'd have no CPU chunking & good UX, but if we have to choose one, which would you sacrifice?

imtringued|6 years ago

There are third party bindings for nodejs https://github.com/luizperes/simdjson_nodejs. As you suspected, converting the entire document to a JS object is not recommended. [0] There is an additional API that allows you to query keys without conversion.

[0] https://github.com/luizperes/simdjson_nodejs/issues/5

luizperes|6 years ago

Yes, that is correct. I spent a lot of time on issue #5 to make as user-friendly as I could, but the only way I found to not have all the C++/JS conversion overhead was to keep the pointer to the external C++-parsed object. There might have other options that I haven't thought of, so if anyone knows of a better approach, let me know.

ksherlock|6 years ago

... Why? This happened when I switched my parser to skip per-byte checks when encountering `"` to next indexOf.

Q: What happens when you parse "\\" ?

marknadal|6 years ago

If string[index-1] === `\\` Then skipAgain