Fun fact: Alex Krizhevsky's cuda-convnet was also an early adopter of CHWN tensor layout. Basically, having the batch size N as the major dimension limits you to running batch sizes that are multiples of the warp size (typically 32), but then you also have an easier time of implementing fast kernels for all your neural and tensor ops, including tensor convolutions, without getting nearly as stuck in the weeds of microarchitectural optimizations.
No comments yet.