(no title)
jfindley | 2 years ago
Your cache question doesn't really have a simple answer either. E.g. an AMD CPU is split into different CCXs. To simplify somewhat, each core is broken up into several smaller compute units, with their own caches and memory controller. Intel has a completely different ring-based approach that's harder to summarise in once sentence.
Overall though, for the sort of work you're describing the limiting factor is often memory bandwidth, not raw compute. Different platforms have very different membw/core figures, and I suspect if you started measuring that then you'd find it easier to predict your codes performance.
adrian_b|2 years ago
This can be easily noticed when comparing the base clock frequencies, which are more or less proportional with the actual clock frequencies that will be reached in multi-threaded applications. For instance a 7950X has 4.5 GHz versus the 3.2 GHz of 14900K. Similar differences are between Epyc and Xeon and between Threadripper and Xeon W.
In desktop CPUs Intel can hide their very poor multi-threaded performance by allowing a much higher power consumption. However this method does not work for server and workstation CPUs, because these already have the highest TDP that is possible with the current cooling solutions, so in servers and workstations the bad Intel MT performance is much more visible. Intel hopes that this will change in 2024, when they will launch server and workstation CPUs made with the new Intel 3 CMOS process.
In the absence of actual benchmarks, a good proxy for the multi-threaded performance of a CPU is the product between the base clock frequency and the number of cores. For Intel hybrid CPUs, an E-core should be counted as 0.6 cores. For example a Threadripper 7960X should be expected to be (24 cores x 4.2 GHz) / (16 cores x 4.5 GHz) = 1.4 times faster than a 7950X in multi-threaded applications that are limited by the CPU cores (but twice faster in applications that are limited by the memory throughput).
sirn|2 years ago
I disagree on this point. I would say this problem is much more critical on Intel's desktop platform than their workstation platform. Xeon Sapphire Rapids is actually very easy to cool, even on air, thanks to the CPU having a much larger surface to dissipate heat than their desktop equivalent.
I have Xeon w9-3495X, and while power consumption is one of its weakest points, it stays under 60°C with water cooling while I pump 500W into it (25°C ambient), of which I see between +30% to +50% gain in multithreaded performance over the default power limit. (Golden Cove needs around ~10W per core, so the default 350W/56c = 6.25W is way below its performance curve.) Noctua has also shown that they're able to achieve ~700W on U12S DX-4677[1] on this platform.
[1]: https://www.youtube.com/watch?v=dCACHpLzapc
jfindley|2 years ago
Your metric about clock speed is, I'm afraid to say, so horribly oversimpified as to be flat out wrong. You can't just multiply core count by clock speed like that, as you're failing to take into account all sorts of other scaling factors such as memory bandwidth, cache size, avx support and so on, which matter as much or more than simple IPS.