top | item 26958140

(no title)

mattpharr | 4 years ago

> It’s good for linear algebra with long vectors and large matrices, but SIMD is useful for many other things besides that

The main goal in ispc's design was to support SPMD (single program multiple data) programming, which is more general than pure SIMD. Handling the relatively easy cases of (dense) linear algebra that are easily expressed in SIMD wasn't a focus as it's pretty easy to do in other ways.

Rather, ispc is focused on making it easy to write code with divergent control flow over the vector lanes. This is especially painful to do in intrinsics, especially in the presence of nested divergent control flow. If you don't have that, you might as well use explicit SIMD, though perhaps via something like Eigen in order to avoid all of the ugliness of manual use of intrinsics.

> I’m pretty sure manually written SSE2 or AVX2 code (inner loop doing _mm_cmpeq_epi8 and _mm_sub_epi8, outer one doing _mm_sad_epu8 and _mm_add_epi64)

ispc is focused on 32-byte datatypes, so I'm sure that is true. I suspect it would be a more pleasant experience than intrinsics for a reduction operation of that sort over 32-bit datatypes, however.

discuss

order

Const-me|4 years ago

> This is especially painful to do in intrinsics

Depends on use case, but yes, can be complicated due to lack of support in hardware. I’ve heard AVX512 fixed that to an extent, but I don’t have experience with that tech.

> perhaps via something like Eigen

I do, but sometimes I can outperform it substantially. It’s optimized for large vectors. In some cases, intrinsics can be faster, and in my line of work I encounter a lot of these cases. Very small matrices like 3x3 and 4x4 fit completely in registers. Larger square matrices of size like 8 or 24, and tall matrices with small fixed count of columns, don’t fit there but a complete row does, saving a lot of RAM latency when dealing with them.

> to avoid all of the ugliness of manual use of intrinsics

I don’t believe they are ugly; I think they just have a steep learning curve.

> I suspect it would be a more pleasant experience than intrinsics for a reduction operation of that sort over 32-bit datatypes

Here’s an example how to compute FP32 dot product with intrinsics: https://stackoverflow.com/a/59495197/126995 I have doubts the ISPC’s reduction gonna result in similar code. Even clang’s automatic vectorizer (which I have a high opinion of) is not doing that kind of stuff with multiple independent accumulators.

atom3|4 years ago

> Here’s an example how to compute FP32 dot product with intrinsics: https://stackoverflow.com/a/59495197/126995 I have doubts the ISPC’s reduction gonna result in similar code. Even clang’s automatic vectorizer (which I have a high opinion of) is not doing that kind of stuff with multiple independent accumulators.

ISPC lets you request that the gang size be larger that the vector size [1] to get 2 accumulators out of the box. If having more accumulator is crucial, you can have them at the cost of not using idiomatic ispc but I'd argue the resulting code is still more readable.

I'm no expert so they might be flaws that I don't see but the generated code looks good to me, the main difference I see is that ISPC does more unrolling (which may be better?).

Here is the reference implementation: https://godbolt.org/z/MxT1Kedf1

Here is the ISPC implementation: https://godbolt.org/z/qcez47GT5

[1] https://ispc.github.io/perfguide.html#choosing-a-target-vect...

creato|4 years ago

> Even clang’s automatic vectorizer (which I have a high opinion of) is not doing that kind of stuff with multiple independent accumulators.

I think it does? I see Clang unroll reductions into multiple accumulators quite often.