The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.
Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.
On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.
On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.
It has no need for that. count_if is a fold/reduce operation where the accumulator is simply incremented by `(int)some_condition(x)` for all x. In Rust:
let arr = [ 1, 3, 4, 6,7, 0, 9, -4];
let n_evens = arr.iter().fold(0, |acc, i| acc + (i & 1 == 0) as usize);
assert_eq!(n_evens, 4);
dzaima|11 months ago
The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.
Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.
On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.
On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.
grandempire|11 months ago
Sharlin|11 months ago