Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?
Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.
Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)
Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.
When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.
neonsunset|1 year ago
Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)
TinkersW|1 year ago
When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.