The kind of CUDA you or I would write is not very hardware specific (a few constants here and there) but the kind of CUDA behind cuBLAS with a million magic flags, inline PTX ("GPU assembly") and exploitation of driver/firmware hacks is. It's like the difference between numerics code in C and and numerics code in C with tons of in-line assembly code for each one of a number of specific processors.You can see similar things if you buy datacenter-grade CPUs from AMD or Intel and compare their per-model optimized BLAS builds and compilers to using OpenBLAS or swapping them around. The difference is not world ending but you can see maybe 50% in some cases.
No comments yet.