top | item 37385602

(no title)

dsharlet | 2 years ago

> The only case I can currently fore see where using LMUL=1 and manually unrolling instead will likely be always beneficial is vrgather operations that don't need to cross between registers in a register group (e.g. byte swapping).

What about algorithms where register pressure is an issue?

I think the problem with LMUL is it assumes that you always want to unroll the innermost dimension (where the vector loads are stride 1). That's usually, the last dimension I try to unroll, if there are any registers left over. If there is any sharing of data across any other dimension in the algorithm, it's better to tile/unroll those first.

Of course, for a simple algorithm, there will be registers left over. But I think more interesting algorithms will struggle on RVV if you must use LMUL > 1 for performance.

discuss

adgjlsfhk1|2 years ago

My favorite example of big LMUL is matmul. You can do an entire gemm microkernel in like 8 instructions with LMUL=4 by using an 7x4 kernel. You have 32 that turn into 8 registers with LMUL=4, 7 of which end up storing your C values, 1 stores your A values and you put the B values in scalar registers. Thus your entire kernel ends up being 1 4 wide vector load load and 7 4x wide FMA instructions.

camel-cdr|2 years ago

> What about algorithms where register pressure is an issue

Then you'll probably saturate the processor without using a larger LMUL, but I think many algorithms can work with LMUL=2, without running out of registers.

brucehoult|2 years ago

LMUL (and especially fractional LMUL) isn't for performance, it's for kernels with mixed-element sizes, to maximise the number of variables (and elements) you can keep in registers without spilling.

Being able to use LMUL as a way to get the effect of unrolling and hide the pointer bumps and loop control on simple loops on narrow processors, without expanding the code, is just a bonus.