It feels crazy to me that Intel spent years dedicating die space on consumer SKUs to "make fetch happen" with AVX-512, and as more and more libraries are finally using it, as Intel's goal is achieved, they have removed AVX-512 from their consumer SKUs.
It isn't that AMD has better AVX-512 support, which would be an impressive upset on it's own. Instead, it is only that AMD has AVX-512 on consumer CPUs, because Intel walked away from their own investment.
That is what Intel does, they build up a market (Optane) and then do a rug pull (Depth Cameras). They continue to do this thing where they do a huge push into a new technology, then don't see the uptake and let it die. Instead of building slowly and then at the right time, doing a big push. Optane support was just getting mature in the Linux kernel when they pulled it. And they focused on some weird cost cutting move when marketing it as a ram replacement for semi-idle VMs, ok.
This is an AMD CPU, but it's clear that the AVX512 benefits are marginal over the AVX2 version. Note that Intel's consumer chips do support AVX2, even on the E-cores.
But there's more to the story: This is a single-threaded benchmark. Intel gave up AVX512 to free up die space for more cores. Intel's top of the line consumer part has 24 cores as a result, whereas AMD's top consumer part has 16. We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores. Note that Intel's E-cores run AVX2 instructions slower than the P-cores, but again the AVX boost is marginal in this benchmark anyway.
I know people like to get angry at Intel for taking a feature away, but the real-world benefit of having AVX512 instead of only AVX2 is very minimal. In most cases, it's probably offset by having extra cores working on the problem. There are very specific workloads, often single-threaded, that benefit from AVX-512, but on a blended mix of applications and benchmarks I suspect Intel made an informed decision to do what they did.
Isn't AVX-10 on the horizon, which will have most of the goodies that AVX-512 had? (I'm actually not even sure what the difference is supposed to be between them.)
The craziest thing about Intel's AVX-512 story is that from 2015 through 2020 they were shipping consumer CPUs with die space reserved for the AVX-512 register file, but no actual AVX-512 capability. Then they shipped one short-lived generation of desktop processors and two generations of laptop processors that actually had AVX-512 functionality before disabling it.
AMD has already been shipping AVX-512 in their consumer processors for longer than Intel did.
I mean, the most interesting part of the article for me:
> A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That is, it was better than the AVX-512 based parser by ~10%, which is pretty significant for Sep.
They fixed it, that's the whole point, but I think there's evidence that AVX-512 doesn't actually benefit consumers that much. I would be willing to settle for a laptop that can only parse 20GB/s and not 21GB/s of CSV. I think vector assembly nerds care about support much more than users.
Intel is horrible with software. My laptop has a pretty good iGPU, but it's not properly supported by PyTorch or most other software. Vulkan inference with llama.cpp does wonders, and it makes me sad that most software other than llama.cpp does not take advantage of it.
Instead of doing 4 comparisons against each character `\n`, `\r`, `;` and `"` followed by 3 or operations, a common trick is to do 1 shuffle, 1 comparison and 0 or operations. I blogged about this trick: https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec... (Trick 2)
Edit: they do make use of ternary logic to avoid one or operation, which is nice. Basically (a | b | c) | d is computed using `vpternlogd` and `vpor` resp.
This is true if you have fixed set of characters within niblet, but is is not necessarily faster, as the shuffle requires the extra register. I tried it and it wasn't faster here.
Take that, Intel and your "let's remove AVX-512 from every consumer CPU because we want to put slow cores on every single one of them and also not consider multi-pumping it"
A lot of this stems from the 10nm hole they had to dig themselves out from. Yields are bad, so costs are high, so let's cut the die as much as possible, ship Atom-derived cores and market it as an energy-saving measure. The expensive parts can be bigger and we'll cut the margins on those to retain the server/cloud sector. Also our earnings go into the shitter and we lose market share anyway, but at least we tried.
They claim a 3GB/s improvement versus previous version of sep on equal hardware — and unlike “marketing” benchmarks, include the actual speed achieved and the hardware used.
Perhaps, but i think we are well past the moore's law era where a 3x speed up is to be expected just from hardware. Its still a pretty impressive feat in the modern era.
> You can't claim this when you also do a huge hardware jump
Well, they did. Personally, I find it an interesting way of looking at it, it's a lens for the "real performance" one could get using this software year over year. (Not saying it isn't a misleading or fallacious claim though.)
If we are lucky we will see Arthur Whitney get triggered and post either a one liner beating this or a shakti engine update and a one liner beating this. Progress!
I have. I think it's a pretty easy situation for certain kinds of startups to find themselves in:
- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.
- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.
- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.
Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.
It's become a very common interchange format, even internally; it's also easy to deflate. I have had to work on codebases where CSV was being pumped out at basically the speed of a NIC card (its origin was Netflow, and then aggregated and otherwise processed, and the results sent via CSV to a master for further aggregation and analysis).
I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?
I shudder to think of what it means to be storing the _results_ of processing 21 GB/s of CSV. Hopefully some useful kind of aggregation, but if this was powering some kind of search over structured data then it has to be stored somewhere...
There’s a calculation for ns/row in the article that is never translated into rows per second but is about 27 ns/row, which is about 37,000 per second. Which means these rows are 570k apiece if that’s 21GB. Which seems like an awfully cooked benchmark.
Considering the non-standard nature of CSV, quoting throughput numbers in bytes is meaningless. It makes sense for JSON, since you know what the output is going to be (e.g. floats, integers, strings, hashmaps, etc).
With CSV you only get strings for each column, so 21 GB/s of comma splitting would be the pinnacle of meaninglessness. Like, okay, but I still have to parse the stringy data, so what gives? Yeah, the blog post does reference float parsing, but a single float per line would count as "CSV".
Now someone might counter and say that I should just read the README.MD, but then that suspicion simply turns out to be true: They don't actually do any escaping or quoting by default, making the quoted numbers an example of heavily misleading advertising.
CSV is standardized in RFC 4180 (well, as standardized as most of what we considered internet "standard").
Otherwise agree, if you don't do escaping (a.k.a. "quoting", the same thing for CSV), you are not implementing it correctly. For example, if you quote a line break, in RFC 4180, this line break will be in that quoted string, but if you don't need to handle that, you can implement CSV parsing much faster (proper handling line break with quoted string requires 2-pass approach (if you are going to use many-core) while not handling it at all can be done with 1-pass approach). I discussed about this detail in https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
The lack of concurrent access support in the official HDF5 library (the only implementation with full format support) can be a major drawback. There is ongoing work on that front [1] though it's unclear when it will land.
In my experience I've found it difficult to get substantial gains with custom SIMD code compared to modern compiler auto-vectorization, but to be fair that was with more vector-friendly code than JSON parsing.
[+] [-] chao-|10 months ago|reply
It isn't that AMD has better AVX-512 support, which would be an impressive upset on it's own. Instead, it is only that AMD has AVX-512 on consumer CPUs, because Intel walked away from their own investment.
[+] [-] sitkack|10 months ago|reply
They keep repeating the same mistakes all the way back to https://en.wikipedia.org/wiki/Intel_iAPX_432
[+] [-] Aurornis|10 months ago|reply
Original: 18 GB/s
AVX2: 20 GB/s
AVX512: 21 GB/s
This is an AMD CPU, but it's clear that the AVX512 benefits are marginal over the AVX2 version. Note that Intel's consumer chips do support AVX2, even on the E-cores.
But there's more to the story: This is a single-threaded benchmark. Intel gave up AVX512 to free up die space for more cores. Intel's top of the line consumer part has 24 cores as a result, whereas AMD's top consumer part has 16. We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores. Note that Intel's E-cores run AVX2 instructions slower than the P-cores, but again the AVX boost is marginal in this benchmark anyway.
I know people like to get angry at Intel for taking a feature away, but the real-world benefit of having AVX512 instead of only AVX2 is very minimal. In most cases, it's probably offset by having extra cores working on the problem. There are very specific workloads, often single-threaded, that benefit from AVX-512, but on a blended mix of applications and benchmarks I suspect Intel made an informed decision to do what they did.
[+] [-] ChadNauseam|10 months ago|reply
[+] [-] wtallis|10 months ago|reply
AMD has already been shipping AVX-512 in their consumer processors for longer than Intel did.
[+] [-] tedunangst|10 months ago|reply
> A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That is, it was better than the AVX-512 based parser by ~10%, which is pretty significant for Sep.
They fixed it, that's the whole point, but I think there's evidence that AVX-512 doesn't actually benefit consumers that much. I would be willing to settle for a laptop that can only parse 20GB/s and not 21GB/s of CSV. I think vector assembly nerds care about support much more than users.
[+] [-] buyucu|10 months ago|reply
[+] [-] MortyWaves|10 months ago|reply
[+] [-] stabbles|10 months ago|reply
Edit: they do make use of ternary logic to avoid one or operation, which is nice. Basically (a | b | c) | d is computed using `vpternlogd` and `vpor` resp.
[+] [-] nietras1|10 months ago|reply
[+] [-] justinhj|10 months ago|reply
[+] [-] Aardwolf|10 months ago|reply
[+] [-] tadfisher|10 months ago|reply
[+] [-] winterbloom|10 months ago|reply
You can't claim this when you also do a huge hardware jump
[+] [-] jbverschoor|10 months ago|reply
Then if we take 0.9.0 on previous hardware (13088) and add the 17%, it's 15375. Version 0.1.0 was 7335.
So... 15375/7335 -> a staggering 2.1x improvement in just under 2 years
[+] [-] freeone3000|10 months ago|reply
[+] [-] bawolff|10 months ago|reply
[+] [-] perching_aix|10 months ago|reply
Well, they did. Personally, I find it an interesting way of looking at it, it's a lens for the "real performance" one could get using this software year over year. (Not saying it isn't a misleading or fallacious claim though.)
[+] [-] WD-42|10 months ago|reply
Straight to the trash with this post.
[+] [-] vessenes|10 months ago|reply
[+] [-] voidUpdate|10 months ago|reply
[+] [-] moregrist|10 months ago|reply
- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.
- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.
- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.
Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.
[+] [-] trollbridge|10 months ago|reply
I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?
[+] [-] sunrunner|10 months ago|reply
[+] [-] hermitcrab|10 months ago|reply
[+] [-] segmondy|10 months ago|reply
[+] [-] constantcrying|10 months ago|reply
I do not think there is an actual explanation besides ignorance, laziness or "it works".
[+] [-] unknown|10 months ago|reply
[deleted]
[+] [-] ourmandave|10 months ago|reply
[+] [-] pak9rabid|10 months ago|reply
[+] [-] criddell|10 months ago|reply
Nice work!
[+] [-] haberman|10 months ago|reply
- What format exactly is it parsing? (eg. does the dialect of CSV support quoted commas, or is the parser merely looking for commas and newlines)?
- What is the parser doing with the result (ie. populating a data structure, etc)?
[+] [-] hinkley|10 months ago|reply
[+] [-] imtringued|10 months ago|reply
Now someone might counter and say that I should just read the README.MD, but then that suspicion simply turns out to be true: They don't actually do any escaping or quoting by default, making the quoted numbers an example of heavily misleading advertising.
[+] [-] liuliu|10 months ago|reply
Otherwise agree, if you don't do escaping (a.k.a. "quoting", the same thing for CSV), you are not implementing it correctly. For example, if you quote a line break, in RFC 4180, this line break will be in that quoted string, but if you don't need to handle that, you can implement CSV parsing much faster (proper handling line break with quoted string requires 2-pass approach (if you are going to use many-core) while not handling it at all can be done with 1-pass approach). I discussed about this detail in https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
[+] [-] constantcrying|10 months ago|reply
The HDF5 format is very good and allows far more structure in your files, as well as metadata and different types of lossless and lossy compression.
[+] [-] jsploit|10 months ago|reply
[1] https://github.com/LifeboatLLC/MT-HDF5
[+] [-] unknown|10 months ago|reply
[deleted]
[+] [-] chpatrick|10 months ago|reply
[+] [-] auxten|10 months ago|reply
[+] [-] theropost|10 months ago|reply
[+] [-] constantcrying|10 months ago|reply
HDF5 gives you a great way to store such data.
[+] [-] anthk|10 months ago|reply
heh, do it again with mawk.
[+] [-] gitroom|10 months ago|reply
[+] [-] jerryseff|10 months ago|reply
[deleted]
[+] [-] h4ck_th3_pl4n3t|10 months ago|reply
[+] [-] zeristor|10 months ago|reply
[+] [-] constantcrying|10 months ago|reply
[+] [-] mcraiha|10 months ago|reply
[+] [-] _ea1k|10 months ago|reply
It is an interesting benchmark anyway.