(no title)
yaantc | 1 year ago
In "Computer Architecture, A Quantitative Approach" there are numbers for the now old TSMC 45nm process: A 32 bits FP multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8 kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its tag comparison and LRU logic (more expansive).
Then I have some 2015 numbers for Intel 22nm process, old too. A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM 16.7 pJ. Basic SRAM here too, not a more expansive cache.
The cost of a multiplication is quadratic, and it should be more linear for access, so the computation cost in the second example is much heavier (compare the mantissa sizes, that's what is multiplied).
The trend gets even worse with more advanced processes. Data movement is usually what matters the most now, expect for workloads with very high arithmetic intensity where computation will dominate (in practice: large enough matrix multiplications).
Remnant44|1 year ago
Earw0rm|1 year ago
kimixa|1 year ago
eigenform|1 year ago
- Capturing 2x512b from the L1D cache
- Sending 2x512b to the vector register file
- Capturing 4x512b values from the vector register file
- Actually multiplying 4x512b values
- Sending 2x512b results to the vector register file
.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?
jiggawatts|1 year ago
I like to ask IT people a trick question: how many numbers can a modern CPU multiply in the time it takes light to cross a room?
bgnn|1 year ago
formerly_proven|1 year ago