The article mentions in an addendum (and BeeOnRope also pointed it out in the HN thread) a nice CLMUL trick for dealing with quotes originally discovered by Geoff Langdale. That should work here for a nice speedup.
But without the CLMUL trick, I'd guess that the unaligned loads that generally occur after a vector containing both quotes and newlines in this version (the "else" case on lines 34-40) would hamper the performance somewhat, since it would eat up twice as much L1 cache bandwidth. I'd suggest dealing with the masks using bitwise operations in a loop, and letting i stay divisible by 16. Or just use CLMUL :)
Thanks for pointing us to CLMUL, I'm not familiar with these kind of multiplications, but, converting the quote bitmask to a quoted bitmask would certainly make it faster. With this new bitmask, we could negate it and AND it with the newline mask, generating a mask of newlines that are not inside quotes. Getting the last newline then would be a simple CLZ of that mask. And there wouldn't be a need to resort to byte to byte processing.
In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D
Even with the CLMUL trick, CSV parsing does not play nice. It can be made to work for JSON parsing because you can make more assumptions. With CSV, it only works smoothly if you are willing to accept a subset of what most spreadsheet programs accept i.e. to assume your CSV is "well-formed". Considering for example that the following three cells:
AA"A,"AA""A","A"A"A
when opened in Excel will all give you the same value, using CLMUL to normalize will require many repeated additional SIMD operations-- probably at least 8 if not more. At some vector size it will be worth it, but not clear at 256. The irony is, if you are stuck with CSV input, then the fact that you couldn't get a better format/encoding also suggests that you can't assume your CSV is "well-formed"
The carryless multiplication instructions are amazing and people should use them more often. They are just so poorly explained that they feel like magic.
Staying with Physics, "Gb/S" is Gigabarns per Siemens. Some relation of electrical conductance with cross-sectional area.
The barn is a unit of cross-sectional area, based on the Uranium nucleus (area 1 barn). Uranium is pretty large in atomic terms; the name is from the idiom "couldn't hit the broad side of a barn".
Stay tuned for a SIMD powered CSV parser library and standalone utility about to drop this weekend. Alpha, but test showing it to be faster than anything else we could get our hands on
Splitting CSV file into chunks and process them independently won't necessarily be wrong (although there are implementations out there that I won't name would, because they do guess). The trick however requires to scan twice: https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
If you're doing user-supplied CSVs, definitely... but if you are ingesting CSVs from a known source with known format (<insert audible sigh here>) it can definitely make sense to use a high-speed optimized ingester.
One might wonder if it might be worth the time to look into optimising the runtimes of various languages. I took a look, all operate on naive byte-by-byte scanning, and all sans PHP are written in the respective language which means any form of SIMD optimization is right off the table (okay, maybe something could be done in Java, but it seems incredibly complex, see https://www.morling.dev/blog/fizzbuzz-simd-style/):
[+] [-] zwegner|4 years ago|reply
Discussion: https://news.ycombinator.com/item?id=29439403
The article mentions in an addendum (and BeeOnRope also pointed it out in the HN thread) a nice CLMUL trick for dealing with quotes originally discovered by Geoff Langdale. That should work here for a nice speedup.
But without the CLMUL trick, I'd guess that the unaligned loads that generally occur after a vector containing both quotes and newlines in this version (the "else" case on lines 34-40) would hamper the performance somewhat, since it would eat up twice as much L1 cache bandwidth. I'd suggest dealing with the masks using bitwise operations in a loop, and letting i stay divisible by 16. Or just use CLMUL :)
[+] [-] davidm1729|4 years ago|reply
Thanks for pointing us to CLMUL, I'm not familiar with these kind of multiplications, but, converting the quote bitmask to a quoted bitmask would certainly make it faster. With this new bitmask, we could negate it and AND it with the newline mask, generating a mask of newlines that are not inside quotes. Getting the last newline then would be a simple CLZ of that mask. And there wouldn't be a need to resort to byte to byte processing.
In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D
[+] [-] mattewong|4 years ago|reply
AA"A,"AA""A","A"A"A
when opened in Excel will all give you the same value, using CLMUL to normalize will require many repeated additional SIMD operations-- probably at least 8 if not more. At some vector size it will be worth it, but not clear at 256. The irony is, if you are stuck with CSV input, then the fact that you couldn't get a better format/encoding also suggests that you can't assume your CSV is "well-formed"
[+] [-] pclmulqdq|4 years ago|reply
[+] [-] jagrsw|4 years ago|reply
gigabytes per second
to
gigabits per siemens
:)
[+] [-] HPsquared|4 years ago|reply
The barn is a unit of cross-sectional area, based on the Uranium nucleus (area 1 barn). Uranium is pretty large in atomic terms; the name is from the idiom "couldn't hit the broad side of a barn".
[+] [-] __exit__|4 years ago|reply
Whoever fixed the title, thank you :D
[+] [-] jeltz|4 years ago|reply
[+] [-] Sebb767|4 years ago|reply
[+] [-] mattewong|4 years ago|reply
[+] [-] mattewong|4 years ago|reply
[+] [-] liuliu|4 years ago|reply
Nice article otherwise!
[+] [-] michaelg7x|4 years ago|reply
[+] [-] michaelg7x|4 years ago|reply
[0] https://github.com/simdjson/simdjson
[+] [-] Tuna-Fish|4 years ago|reply
[+] [-] rwmj|4 years ago|reply
[+] [-] mschuster91|4 years ago|reply
One might wonder if it might be worth the time to look into optimising the runtimes of various languages. I took a look, all operate on naive byte-by-byte scanning, and all sans PHP are written in the respective language which means any form of SIMD optimization is right off the table (okay, maybe something could be done in Java, but it seems incredibly complex, see https://www.morling.dev/blog/fizzbuzz-simd-style/):
- PHP isn't optimized anywhere, but at least it's C: https://github.com/php/php-src/blob/1c0e613cf1a24cdc159861e4...
- Python's C implementation is the same: https://github.com/python/cpython/blob/main/Modules/_csv.c
- Java doesn't have a "standard" way at all (https://www.baeldung.com/java-csv-file-array), and OpenCSV seems the usual object-oriented hell (https://sourceforge.net/p/opencsv/source/ci/master/tree/src/...).
- Ruby's CSV is native Ruby: https://github.com/ruby/ruby/blob/bd65757f394255ceeb2c958e87...