top | item 46883422

(no title)

kioku | 26 days ago

> Our key insight is to offload critical softmax primitives to idle tensor units, maximizing hardware utilization and throughput.

> … speedups of 1.05–1.17×across diverse attention configurations on Ampere and Hopper GPUs …

discuss

No comments yet.