(no title)
Vogtinator | 1 year ago
The latency shows after how many cycles the result of an instruction can be consumed by another, while the throughput shows how many such instructions can be pipelined per cycle, i.e. in parallel.
Vogtinator | 1 year ago
The latency shows after how many cycles the result of an instruction can be consumed by another, while the throughput shows how many such instructions can be pipelined per cycle, i.e. in parallel.
wtallis|1 year ago
BeeOnRope|1 year ago
For example, for many years Intel chips had a multiplier unit on a single port, with a latency of 3 cycles, but an inverse throughput of 1 cycle, so effectively pipelined across 3 stages.
In any case, I think uops.info [1] has replaced Agner for up-to-date and detailed information on instruction execution.
---
[1] https://uops.info/table.html
ajross|1 year ago
The Fog tables try hard show the former, not the latter. You measure dispatch parallelism with benchmarks, not microscopes.
Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time.
BeeOnRope|1 year ago
Pipelining in stages like fetch and decode are mostly hidden in these small benchmarks, but are visible when there are branch misprediction, other types of flushes, I$ misses and so on.
gpderetta|1 year ago
Being able to execute multiple instructions is more properly superscalar execution, right? In-order designs are also capable of doing it and the separate execution unit do not even need to run in lockstep (consider the original P5 U and V pipes).
eigenform|1 year ago
I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.