You could have a separate accumulator for each shuffle, which should allow 28 iterations. (and merge those together at the end of the 28-iteration-loop by widening to u16; at which point you could have an outer loop accumulating in u16 until that runs out)
dzaima|1 year ago