top | item 42550193

(no title)

Vogtinator | 1 year ago

For x86 cores this is visible in Agner Fog's instruction performance tables: https://agner.org/optimize/#manuals

The latency shows after how many cycles the result of an instruction can be consumed by another, while the throughput shows how many such instructions can be pipelined per cycle, i.e. in parallel.

discuss

order

wtallis|1 year ago

I believe the throughput shown in those tables is the total throughput for the whole CPU core, so it isn't immediately obvious which instructions have high throughput due to pipelining within an execution unit and which have high throughput due just to the core having several execution units capable of handling that instruction.

BeeOnRope|1 year ago

That's true, but another part of the tables show how many "ports" the operation can be executed on, which is enough information to concluded an operation is pipelined.

For example, for many years Intel chips had a multiplier unit on a single port, with a latency of 3 cycles, but an inverse throughput of 1 cycle, so effectively pipelined across 3 stages.

In any case, I think uops.info [1] has replaced Agner for up-to-date and detailed information on instruction execution.

---

[1] https://uops.info/table.html

ajross|1 year ago

FWIW, there are two ideas of parallelism being conflated here. One is the parallel execution of the different sequential steps of an instruction (e.g. fetch, decode, operate, retire). That's "pipelining", and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution" tends to connote the same idea in practice).

The Fog tables try hard show the former, not the latter. You measure dispatch parallelism with benchmarks, not microscopes.

Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time.

BeeOnRope|1 year ago

I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).

Pipelining in stages like fetch and decode are mostly hidden in these small benchmarks, but are visible when there are branch misprediction, other types of flushes, I$ misses and so on.

gpderetta|1 year ago

> and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution

Being able to execute multiple instructions is more properly superscalar execution, right? In-order designs are also capable of doing it and the separate execution unit do not even need to run in lockstep (consider the original P5 U and V pipes).

eigenform|1 year ago

> Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time

I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.