top | item 33281257

(no title)

jra101 | 3 years ago

Have you tried reducing the register count in your FP32 FMA test by increasing the iteration count and reducing the number of values computed per loop?

Instead of computing 8 independent values, compute one with 8x more iterations:

    for (int i = 0; i < count * 8; i++) {
        v0 += acc * v0; 
    }

That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.

discuss

clamchowder|3 years ago

The problem is loop overhead matters on AMD, because AMD's compiler doesn't unroll the loop. Nvidia's does, so it doesn't matter for them.

WithinReason|3 years ago

unroll with #pragma unroll?