top | item 34906048

(no title)

loser777 | 3 years ago

I'm not sure how to reconcile the purported gains with the fact that matrix multiplies are empirically the most heavily accelerated primitive [1] on current-gen hardware and that the "digital ops" shown here aren't even a blip on the "fractional of total compute" Figure 6. Sure, they're very small in terms of FLOPs, but they take up a disproportionate amount of time being bandwidth-bound. Intuitively, adding another hop off-chip and A/D or D/A conversion doesn't sound great, and I wonder if that's why this work sticks to efficiency over end-to-end throughput. Given that GPUs today mostly trade efficiency for clock rate and speed (think about how a single GPU can be at > 300W TDP), how much efficiency could we gain by simply inverting that tradeoff?

[1] https://twitter.com/cHHillee/status/1601371646756933632

discuss

order

cycomanic|3 years ago

I haven't read this paper yet but I'm familiar with the general work. The aspect that everyone ignores is that, yes linear transformations like matrix operations or fourier transforms are incredibly fast in optics, however the nonlinearity is the sticker. While optical propagation is nonlinear, you need very high intensities. The elephant in the room is that the linear operations rely on parallelism, i.e. they split the optical power up into multiple paths so each path has very low intensity, thus exhibits low nonlinearity. The solution that has been that everyone simply used optical to electrical conversion and did the nonlinearity digitally (or sometimes in analog electronics). That sort of works for one layer, but completely falls apart for multiple layers, it is neither cost not energy efficient to have hundreds or possibly thousands of a/d converters.

regularfry|3 years ago

It's interesting because of the scaling law. No matter how much acceleration matrix multiplication gets on an electronic circuit, its energy usage is always going to scale as O(n^2.something). The implication here is that the energy usage by doing it optically is O(1). At least, that's how I read "We found that the optical energy per multiply-accumulate (MAC) scales as 1/d where d is the Transformer width". The best you can hope for is to stay on the right side of the constant factors (which, currently, the GPU world is).