(no title)
Bimos | 1 year ago
> We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).
> To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.
I would say it is really mind-blowing.
blackeyeblitzar|1 year ago
mitthrowaway2|1 year ago
Bimos|1 year ago
Zacharias030|1 year ago
fracon|1 year ago
[deleted]
shaklee3|1 year ago
ETH_start|1 year ago
tough|1 year ago
dang|1 year ago
I love it when words turn into their opposites!
Bimos|1 year ago
kneegerman|1 year ago