top | item 43310304

(no title)

aqrit | 11 months ago

A optimized version would use 64-bit accumulators (`psadbw` on SSE2, or some sort of horizontal adds on NEON). The `255` max constraint is pointless.

Many programming languages/frameworks expose this operation as `reduce()`.

discuss

order

dzaima|11 months ago

It's not that trivial:

The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.

Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.

On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.

On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.

grandempire|11 months ago

Reduce does not accept a predicate.

Sharlin|11 months ago

It has no need for that. count_if is a fold/reduce operation where the accumulator is simply incremented by `(int)some_condition(x)` for all x. In Rust:

  let arr = [ 1, 3, 4, 6,7, 0, 9, -4];
  let n_evens = arr.iter().fold(0, |acc, i| acc + (i & 1 == 0) as usize);
  assert_eq!(n_evens, 4);
Or more generally,

  fn count_if<T>(it: impl Iterator<Item=T>, pred: impl Fn(&T) -> bool) -> usize {
      it.fold(0, |acc, t| acc + pred(&t) as usize)
  }