I'm not sure how to reconcile the purported gains with the fact that matrix multiplies are empirically the most heavily accelerated primitive [1] on current-gen hardware and that the "digital ops" shown here aren't even a blip on the "fractional of total compute" Figure 6. Sure, they're very small in terms of FLOPs, but they take up a disproportionate amount of time being bandwidth-bound. Intuitively, adding another hop off-chip and A/D or D/A conversion doesn't sound great, and I wonder if that's why this work sticks to efficiency over end-to-end throughput. Given that GPUs today mostly trade efficiency for clock rate and speed (think about how a single GPU can be at > 300W TDP), how much efficiency could we gain by simply inverting that tradeoff?[1] https://twitter.com/cHHillee/status/1601371646756933632
cycomanic|3 years ago
regularfry|3 years ago