top | item 43167486

(no title)

Bimos | 1 year ago

The PTX instructions they talked about in the tech report should be pointing to the code here?

discuss

"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."

magicalhippo|1 year ago

So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?

Or did I get that wrong?

helloericsf|1 year ago

this might help: https://x.com/main_horse/status/1894215779521794058/photo/1