I'm just speculating (and haven't read the paper yet), but it may be possible to achieve similar speedups on GPUs by pruning the smallest 20% of blocks of size ≥K×K to produce block-sparse weights[0], rather than pruning the smallest 20% of weights.[0] https://openai.com/blog/block-sparse-gpu-kernels/
No comments yet.