Another benefit of specialized hardware, besides dispatch costs, is the cost of moving data around. As your chip gets wider you want more physical (rather than architectural) registers with more ports meaning super linear growth. And your bypass network also grows quadratically in transistor terms. And as your core gets bigger physically you lose more power moving data longer distances.
But in dedicated hardware you can just gang your operations into dataflows where the output of one stage feeds into the physically adjacent next stage with no need to make a trip through the register file or bypass network.
A lot of the benefit of hardware vector operations over scalar operations is in the dispatch cost, but most of the benefit from hardware matrix operations over hardware vector operations is from reduced data movement.
EDIT: Of course, the post is from 2012 back when nobody was doing hardware matrix multiplication so it's understandable.
A lot of people including "me" (TFA author, of course "me" = larger team) were doing hardware matrix multiplication for many years before 2012 [let's say before deep learning.] I count "moving data around" as "dispatching costs" [dispatching = figuring out what to do, on what data and where to put results as opposed to actually doing it.]
I recall a demo app from my teen years (late 1908s to early 90s) that demonstrated, after installing your coprocessor, improved 3D performance by rotating at "high" speed a 3D sprite on the screen. This was an Intel 80386 with an 80387 coprocessor - I thought it was doing matrix math. Nope, the 8037 was an external floating-point unit that arrived two years after the processor with which it was paired. Cringey.
Reminds of when I was doing firmware development and the ASIC team would ask if they could toss in a extra Cortex-M3 core to solve specific control problems. Those cores would be used as programmable state machines. For the ASIC team tossing in a extra core was free compared to custom logic design. However for the firmware team it would be another job to write and test that firmware. We had designs with upwards of 10 Cortex-M3 cores. I've heard from a friend at another employer had something like 32 such cores and it was a pain to debug.
In addition to TFA's reason, bluntly, because we have the transistors. Dennard scaling has ended which has meant that we can't continue to increase clock frequencies. However, transistor counts have continued to increase. This has basically forced CPU manufacturers to focus on multicore because we have the transistors.
Also, big/little, gating off unused silicon and other approaches can save energy even as they use more transistors.
There is also a heat aspect to this. You can run many cores at lower frequency to save on heat. This makes perf more consistent as you can avoid thermal throttle vs a single bursty core.
The "extract bits 3 to 7 and multiply by 13" example is about data locality, to some extent. It's cheaper to keep data in a local circuit, than to ship it around between general-purpose registers.
"Done in hardware" means "done directly in hardware"; the directly part is understood, because everyone knows that everything is ultimately done in hardware.
Something not done directly in hardware is done in software. That means it's done using more hardware resources compared to directly in hardware.
QED; directly in hardware is cheaper.
Cheaper to operate, anyway, not necessarily cheaper to produce. You have to move a decent volume before it becomes economic to optimize a solution into hardware. Also, a mistake discovered in the field in hardware is more costly than a mistake in upgradable software.
I don't agree with the tone of the article. Doing complex functions in hardware is a lot cheaper, often by 100X, compared to doing them in software.
As an extreme case, to do a simple 32-bit add, you light up tens of millions of transistors if the addition goes through a CPU pipeline. The adder itself of course only requires a few transistors...
Saying that "specialization saves dispatching costs" is minimizing the savings by orders of magnitude. Of course, the article is correct in pointing out that hardware doesn't make things free.
TFA author. How does the word "dispatching" hint at the order of magnitude of anything? What orders of magnitude do depend on is your competition. Outdoing a CPU at operation X is easier than outdoing a GPU at X [assuming the GPU does X reasonably well] which is easier than outdoing a DSP at X [assuming the DSP does X reasonably well]. If your competition is reasonably optimized programmable accelerators, your opportunities to beat it start shrinking.
Its done in hardware because its faster, and tech support can just switch it on and off in case of panic, and flicking the switch is a moment of anxiety relief, for those with a tendancy to flick a switch in times of absolute pannick.
Its a good article but the author missed the third reason hardware can be vastly cheaper, which is lack of abstraction.
You can use google search to add 0x9 + 0x2 and get hexadecimal 0xb... however that involves dozens of layers of abstraction and endless formatting and parsing that are fundamentally useless in the long run for something like a GPU display.
The 4th reason hardware is vastly cheaper is it needs less testing.
In the example above you can either trust your FPGA/ASIC software to implement a byte-wide full adder properly, because thats kinda a basic task for that technology, or you can whack the byte-wide adder with all possible test cases in a couple ns on real hardware, all possible binary inputs and outputs are quite well known and trivial. When you ask google, or worse, alexa, to add two hexadecimal digits there are uncountable number of theoretical buffer overflows and MITM attacks possible spyware/virus infections and similar such nonsense at multiple layers you probably are not even aware of.
The 5th reason hardware is vastly cheaper is environmentalism and energy costs. I have trouble estimating the energy cost of a byte-wide adder in a ASIC or a CPU, surely it can't be more than charging and discharging a couple sub-pF capacitors... takes billions of transistors switching like crazy to dump the 100 watts a server motherboard can dump and a full adder doesn't take many transistors. On the other hand the infrastructure and environmental damage required to ask Alexa to add two hex digits is very high. You can piggy back on it by passing the buck; well, we need that environmental damage and economic cost to enable netflix at which point asking alexa questions is a drop in the bucket, but people have polluted for centuries on the same argument (well, its just a little extra lost plutonium and compared to above ground nuclear testing its a drop in the bucket, etc)
The Alexa bit really shows that processing speed or "cheapness" doesn't always matter. The VM in AWS that eventually does that hex addition could have done 10^N more of those same additions in the time it takes Alexa to hear the question and respond.
But, humans are big and laggy, and I don't know if I could type in the question to Google or even a terminal faster than getting the answer from Alexa.
[+] [-] Symmetry|7 years ago|reply
But in dedicated hardware you can just gang your operations into dataflows where the output of one stage feeds into the physically adjacent next stage with no need to make a trip through the register file or bypass network.
A lot of the benefit of hardware vector operations over scalar operations is in the dispatch cost, but most of the benefit from hardware matrix operations over hardware vector operations is from reduced data movement.
EDIT: Of course, the post is from 2012 back when nobody was doing hardware matrix multiplication so it's understandable.
[+] [-] yosefk|7 years ago|reply
[+] [-] delinka|7 years ago|reply
[+] [-] pkaye|7 years ago|reply
[+] [-] CalChris|7 years ago|reply
In addition to TFA's reason, bluntly, because we have the transistors. Dennard scaling has ended which has meant that we can't continue to increase clock frequencies. However, transistor counts have continued to increase. This has basically forced CPU manufacturers to focus on multicore because we have the transistors.
Also, big/little, gating off unused silicon and other approaches can save energy even as they use more transistors.
[+] [-] jayd16|7 years ago|reply
[+] [-] deepnotderp|7 years ago|reply
If you look at specialized processors for say, convolutions, the majority of the benefits are coming from data locality being exploited.
(And no I've never heard the term "dispatch" be used for data movement)
[+] [-] p1mrx|7 years ago|reply
[+] [-] kazinator|7 years ago|reply
Something not done directly in hardware is done in software. That means it's done using more hardware resources compared to directly in hardware.
QED; directly in hardware is cheaper.
Cheaper to operate, anyway, not necessarily cheaper to produce. You have to move a decent volume before it becomes economic to optimize a solution into hardware. Also, a mistake discovered in the field in hardware is more costly than a mistake in upgradable software.
[+] [-] dang|7 years ago|reply
[+] [-] depressed|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] alain94040|7 years ago|reply
As an extreme case, to do a simple 32-bit add, you light up tens of millions of transistors if the addition goes through a CPU pipeline. The adder itself of course only requires a few transistors...
Saying that "specialization saves dispatching costs" is minimizing the savings by orders of magnitude. Of course, the article is correct in pointing out that hardware doesn't make things free.
[source: my day job]
[+] [-] _yosefk|7 years ago|reply
Source: my day job
[+] [-] sbhn|7 years ago|reply
[+] [-] VLM|7 years ago|reply
You can use google search to add 0x9 + 0x2 and get hexadecimal 0xb... however that involves dozens of layers of abstraction and endless formatting and parsing that are fundamentally useless in the long run for something like a GPU display.
The 4th reason hardware is vastly cheaper is it needs less testing.
In the example above you can either trust your FPGA/ASIC software to implement a byte-wide full adder properly, because thats kinda a basic task for that technology, or you can whack the byte-wide adder with all possible test cases in a couple ns on real hardware, all possible binary inputs and outputs are quite well known and trivial. When you ask google, or worse, alexa, to add two hexadecimal digits there are uncountable number of theoretical buffer overflows and MITM attacks possible spyware/virus infections and similar such nonsense at multiple layers you probably are not even aware of.
The 5th reason hardware is vastly cheaper is environmentalism and energy costs. I have trouble estimating the energy cost of a byte-wide adder in a ASIC or a CPU, surely it can't be more than charging and discharging a couple sub-pF capacitors... takes billions of transistors switching like crazy to dump the 100 watts a server motherboard can dump and a full adder doesn't take many transistors. On the other hand the infrastructure and environmental damage required to ask Alexa to add two hex digits is very high. You can piggy back on it by passing the buck; well, we need that environmental damage and economic cost to enable netflix at which point asking alexa questions is a drop in the bucket, but people have polluted for centuries on the same argument (well, its just a little extra lost plutonium and compared to above ground nuclear testing its a drop in the bucket, etc)
[+] [-] blattimwind|7 years ago|reply
This goes against all experience I had with hardware, and everything I have ever heard from every single embedded/electronics engineer.
[+] [-] acpetrov|7 years ago|reply
But, humans are big and laggy, and I don't know if I could type in the question to Google or even a terminal faster than getting the answer from Alexa.
[+] [-] int0x80|7 years ago|reply