top | item 29448500

(no title)

glangdale | 4 years ago

I started to build "simdcsv" - based on similar principles to "simdjson", but succumbed to lassitude. There's still a fork kicking around.

Doing the CSV version of the quote convention is actually wildly easier than the JSON version, as you can just treat a record like "foo""bar",hi,1,"hatstand,teakettle" as having 'left the quotes and rentered the quotes" at the double-quote spot when you're busy looking for ','. This isn't, of course, much help for normalization, but for the bit where you're hoping to simply find which ',' characters are separators, it's fine to pretend that there's a gap in your "quoted stuff bitfield" that happens at the "" in "foo""bar", as of course that gap isn't going to land on a ',' anyhow. So it's much cheaper than doing a tedious shenanigan to handle some crafty user who has hit you with 100 \ characters in a row (as opposed to 99 or 101), a la JSON.

IMO the cost of doing the CLMUL all the time vs taking a conditional to handle the 'I have no quotes today' case is pretty low. It makes for shorter and more easily understood (in performance terms) code. I wasn't allowed to take this to extremes in simdjson - we did handle the "common case" for UTF-8 validation (although I wonder if that's a very culturally determined notion of "common case").

discuss

No comments yet.