(no title)
fossa1
|
8 months ago
This is a textbook case of micro-architectural reality beats theoretical elegance. It's fascinating how replacing 5 loads with 2 loads + 3 vextq_f32 intrinsics, which should reduce memory pressure, ends up being slower due to execution port contention and dependency chains.
No comments yet.