Number Parsing at a Gigabyte per Second

[+] MaxBarraclough|5 years ago|reply

Somewhat related, from the same author: Base64 encoding and decoding at almost the speed of a memory copy. (2019) [0]

Also this discussion of algorithms for the inverse problem, converting floats to strings. [1]

[0] https://news.ycombinator.com/item?id=21459839

[1] https://news.ycombinator.com/item?id=24939411

[+] di4na|5 years ago|reply

Reading it, i get a strange deja-vu. Having been spending the past 3 months on implementing Ryu to do the reverse operation, this looks a lot like what Ulf Adams have been working on to do the reverse based on Ryu. The lookup table are really close and the map look the same.

While Ryu and Ulf work is indeed cited for the reverse, i do not see any acknowledgement for his work on the string to float version.... despite really close similitude.

I suppose sometimes it is too obvious...

[+] FrozenVoid|5 years ago|reply

These are really nice libraries, but they indicate there is a problem with input data having "gigabytes of ASCII floats", where you store floats as strings instead of binary formats(which doesn't require parsing anything). for scripts and user-input, i'd rather use the standard stuff that has been tried and tested(strtod/strtold).

[+] MaxBarraclough|5 years ago|reply

If it's so cheap to decode that it's barely slower than copying the bytes, the problem largely goes away, no?

No doubt you're right that it would be more efficient to store the binary representation natively, especially assuming portability isn't an issue, and especially if a zero-copy solution were used, but many real-world systems have to cope with data formats that aren't the most efficient. Huge amounts of data are shipped as XML and JSON.

[+] hacker_9|5 years ago|reply

Sometimes third parties give you json and that's it, in these cases these libraries are useful if being fastest to react is a constraint.

[+] asicsp|5 years ago|reply

>On my Apple M1 MacBook, using a realistic data file (canada), we get that fast_float can far exceeds a gigabyte per second, and get close to 2 GB/s. The conventional C function (strtod) provided by the default Apple standard library does quite poorly on this benchmark.

[+] Out_of_Characte|5 years ago|reply

This seems very usefull but only when you have gigabytes of compliant ascii text of floating point numbers. I do wonder how well this performs on non-compliant text and wether there are systems out there that are limited by the 130MB transfer speeds of standard libraries.

Now that I think about it, Excel spreadsheets, Json, XML,textfiles are all mayor contributors to sometimes very flawed ascii-based workloads that should have a complementary binary backing.

[+] the8472|5 years ago|reply

Not all optimization work consists of attacking the dominating function. Sometimes most low-hanging fruit already are plucked and you'll have to speed up dozens of smaller things and float parsing can be one of those.

[+] maccard|5 years ago|reply

Just because something isn't the bottleneck doesn't mean it's not worth optimising.if you spend 10% decoding, 80% working and 10% saving the results, saving that first or last 10% is definitely worthwhile.

My $400 consumer motherboard has a 10Gb network card, and my SSD reads at over 50Gb/s. Anything that brings IO closer to those speeds is welcome.

[+] jhokanson|5 years ago|reply

I'm curious as to what the biggest win in terms of speed was here (in terms of an approach, good lookup tables?). Also I'm curious how this compares to the many (?) JSON parsers that have rolled their own number parser because everyone knows the standard library is so slow ... (just more accurate?, faster?). Regardless, kudos to the authors on their work!

[+] SloopJon|5 years ago|reply

He touched on JSON parsers in a previous post about fast_double_parser: "People who write fast parsers tend to roll their own number parsers (e.g., RapidJSON, sajson), and so we did. However, we sacrifice some standard compliance." (The "we" in this context refers to simdjson.)

https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...

He followed up in a comment: "RapidJSON has at least two fast-parsing mode. The fast mode, which I think is what you refer to, is indeed quite fast, but it can be off by one ULP, so it is not standard compliant."

The Github README for this new project says, "The fast_float library provides a performance similar to that of the fast_double_parser library."

https://github.com/fastfloat/fast_float

However, the benchmarks show a significant improvement relative to those in the fast_double_parser README:

https://github.com/lemire/fast_double_parser

I tried to run the benchmarks, but my CMake is apparently too old, and Homebrew barfed all over the living room rug when I tried to update it.

[+] lenkite|5 years ago|reply

This referenced paper is dense yet surprisingly readable. Looking forward to all all C++ implementations implementing the std::from_chars function.

[+] sjdhasjhd|5 years ago|reply

[deleted]

19 comments