top | item 39741076

(no title)

stncls | 1 year ago

In many cases yes. Some single-threaded workloads are very sensitive to e.g. memory latency. They end up spending most of their time with the CPU waiting on a cache-missed memory load to arrive.

Typically, those would be sequential algorithms with large memory needs and very random (think: hash table) memory accesses.

Examples: SAT solvers, anything relying on sparse linear algebra

discuss

twoodfin|1 year ago

Obscure personal conspiracy theory: The CPU vendors, notably Intel, deliberately avoid adding the telemetry that would make it trivial for the OS to report a % spent in memory wait.

Users might realize how many of their cores and cycles are being effectively wasted by limits of the memory / cache hierarchy, and stop thinking of their workloads as “CPU bound”.

awjlogan|1 year ago

Arm v8.4 onwards has exactly this (https://docs.kernel.org/arch/arm64/amu.html). It counts the number of (active) cycles where instructions can't be dispatched while waiting for data. There can be a very high percentage of idle cycles. Lots of improvements to be found with faster memory (latency and throughput).

fulafel|1 year ago

The performance counters for that have been in the chips for a long time. You can argue that perf(1) has unfriendly UX of course.

NekkoDroid|1 year ago

I think AMD has a tool to check something somewhat related (Cache misses) in AMD uProf