It’s a good example to illustrate how to get more simd from the compiler
But the overly specific constraint means this is not a general count_if algorithm.
For this to be useful I have to:
- know there are only 255 true values
- but have a large dataset so it’s worth optimizing
- not want to stop early when some threshold is met
This is so specialized it’s not even worth having a generic predicate argument for.
Seems like you can do this sort of speed up even without the 256 constraint. Just run this sped up version, but after each set of 256 iterations, dump all the 8-bit counters to the 64-bit final result.
Some people already mentioned this in the r/cpp discussion. Small correction: 256 is not the correct number of iterations, since if all elements in that slice are even, then your 8-bit counter will wrap-around to zero, which can lead to a wrong answer. What you want is 255 iterations.
I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow.
Like @wffurr mentioned, this is indeed discussed in a footnote. I just added another remark to the same footnote:
"It's also debatable whether or not Clang's 'optimization' results in better codegen in most cases that you care about. The same optimization pass can backfire pretty easily, because it can go the other way around too. For example, if you assigned the `std::count_if()` result to a local `uint8_t` value, but then returned that value as a `uint64_t` from the function, then Clang will assume that you wanted a `uint64_t` accumulator all along, and thus generates the poor vectorization, not the efficient one."
I was hoping you could just provide an iterator_traits with a uint8_t difference type, but this is tied to the iterator type rather than specified separately, so you'd need some kind of iterator wrapper to do this.
Why did the compiler even chose to fetch DWORDs only in the first place? It's unclear to me why the accumulator (apparently) determines the vectorization width?
I'd normally agree with you, but in this case this function is meant to be used for vectorizable input so it doesn't really matter since it's using a random-access iterator, otherwise you should go with the usual std::count_if().
Meta-question: Given how common :: is in programming languages, and how rare of a typo it is, is it really worth it for HN's title filtering logic to include a rule for automatically replacing it with a single :?
Soon I am wondering if rather than rely on finicky auto-vectorization we’ll just have LLMs help “hand-optimize” more routines. Just like how memcmp and memcpy are optimized by hand today maybe like 20% of the program could just be LLM-assisted assembly. @ffmpeg on X thinks maybe they are starting to get it [1] and I had some success having an LLM generate working WebAssembly [2]
https://x.com/ffmpeg/status/1898408922769223994?s=46
Instead of replacing one finicky and temperamental approach (auto-vectorizers) with another (LLM codegen) I'd much rather see more exploration of explicit SIMD abstractions like Intel's ISPC language. Shader languages for GPUs had this figured out forever ago, there is a sensible middle-ground to be had between brittle compiler magic and no compiler at all.
grandempire|11 months ago
But the overly specific constraint means this is not a general count_if algorithm.
For this to be useful I have to: - know there are only 255 true values - but have a large dataset so it’s worth optimizing - not want to stop early when some threshold is met
This is so specialized it’s not even worth having a generic predicate argument for.
aqrit|11 months ago
Many programming languages/frameworks expose this operation as `reduce()`.
meisel|11 months ago
nicula|11 months ago
I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow.
tomn|11 months ago
https://gcc.godbolt.org/z/E5oTW5eKe
nicula|11 months ago
"It's also debatable whether or not Clang's 'optimization' results in better codegen in most cases that you care about. The same optimization pass can backfire pretty easily, because it can go the other way around too. For example, if you assigned the `std::count_if()` result to a local `uint8_t` value, but then returned that value as a `uint64_t` from the function, then Clang will assume that you wanted a `uint64_t` accumulator all along, and thus generates the poor vectorization, not the efficient one."
gblargg|11 months ago
wffurr|11 months ago
newgre|11 months ago
TinkersW|11 months ago
After the loop it will do a horizontal add across the vector register to produce the final scalar result.
unknown|11 months ago
[deleted]
zombot|11 months ago
Should be ++it. Post-increment is generally more expensive, especially when you don't know the exact type you're applying it to.
listeria|11 months ago
Then again, it doesn't hurt to be pedantic.
tdhz77|11 months ago
TheCoreh|11 months ago
dcsommer|11 months ago
nicula|11 months ago
unknown|11 months ago
[deleted]
eurekabot123|11 months ago
[deleted]
fsafdsaewr|11 months ago
[deleted]
jasonthorsness|11 months ago
https://www.jasonthorsness.com/24
jsheard|11 months ago