top | item 39856262

(no title)

taktoa | 1 year ago

> In a clocked design the clock signal needs to be routed to every element on the chip which requires a lot of power, the more so the higher the frequency is.

Clock only needs to be distributed to sequential components like flip flops or SRAMs. The number of clock distribution wire-millimeters in typical chip is dwarfed by the number of data wire-millimeters, and if a neural network is well trained and quantized activations should be random, so number of transitions per clock should be 0.5 (as opposed to 1 for clock wires), meaning that power can't be dominated by clock. The flops that prevent clock skew are a small % of area, so I don't think those can tip the scales either. On the other hand, in asynchronous digital logic you need to have valid bit calculation on every single piece of logic, which seems like a pretty huge overhead to me.

discuss

HarHarVeryFunny|1 year ago

There's obvious potential savings in not wasting FLOPs recalculating things unnecessarily, but I'm not sure how much of that could be realized by just building a data-flow digital GPU. The only attempt at a data-flow digital processor I'm aware of was AMULET (by ARM designer Steve Furber), which was not very successful.

There's more promise in analog chip designs, such as here:

https://spectrum.ieee.org/low-power-ai-spiking-neural-net

Or otherwise smarter architectures (software only or S/W+H/W) that design out the unnecessary calculations.

It's interesting to note how extraordinarily wasteful transformer-based LLMs are too. The transformer was designed part inspired by linguistics and part based on the parallel hardware (GPU's etc) available to run it on. Language mostly has only local sentence structure dependencies, yet transformer's self-attention mechanism has every word in a sentence paying attention to every other word (to some learned degree)! Turns out it's better to be dumb and fast than smart, although I expect future architectures will be much more efficient.