top | item 41319851

(no title)

xoranth | 1 year ago

Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?

Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.

discuss

neonsunset|1 year ago

They are pretty good: https://dougallj.github.io/applecpu/measurements/firestorm/L...

Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)

TinkersW|1 year ago

Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.

When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.