top | item 37944932

(no title)

jfindley | 2 years ago

Clock speed isn't a particularly meaningful measurement anymore, and hasn't been for years. For example, an AMD Genoa chip, depending on SKU, may have fairly comparable base/boost clock speeds compared to an Intel Sapphire Rapids - but in practice the single-core performance of the Intel is going to be substantially better for most code.

Your cache question doesn't really have a simple answer either. E.g. an AMD CPU is split into different CCXs. To simplify somewhat, each core is broken up into several smaller compute units, with their own caches and memory controller. Intel has a completely different ring-based approach that's harder to summarise in once sentence.

Overall though, for the sort of work you're describing the limiting factor is often memory bandwidth, not raw compute. Different platforms have very different membw/core figures, and I suspect if you started measuring that then you'd find it easier to predict your codes performance.

discuss

order

adrian_b|2 years ago

While at equal clock frequency the Intel CPUs are a little faster in single-thread applications, their main weakness is that at equal power consumption their clock frequencies are much lower in multi-threaded applications, which leads to much lower multi-threaded performance.

This can be easily noticed when comparing the base clock frequencies, which are more or less proportional with the actual clock frequencies that will be reached in multi-threaded applications. For instance a 7950X has 4.5 GHz versus the 3.2 GHz of 14900K. Similar differences are between Epyc and Xeon and between Threadripper and Xeon W.

In desktop CPUs Intel can hide their very poor multi-threaded performance by allowing a much higher power consumption. However this method does not work for server and workstation CPUs, because these already have the highest TDP that is possible with the current cooling solutions, so in servers and workstations the bad Intel MT performance is much more visible. Intel hopes that this will change in 2024, when they will launch server and workstation CPUs made with the new Intel 3 CMOS process.

In the absence of actual benchmarks, a good proxy for the multi-threaded performance of a CPU is the product between the base clock frequency and the number of cores. For Intel hybrid CPUs, an E-core should be counted as 0.6 cores. For example a Threadripper 7960X should be expected to be (24 cores x 4.2 GHz) / (16 cores x 4.5 GHz) = 1.4 times faster than a 7950X in multi-threaded applications that are limited by the CPU cores (but twice faster in applications that are limited by the memory throughput).

sirn|2 years ago

> However this method does not work for server and workstation CPUs, because these have already the highest TDP that is possible with the current cooling solutions, so in servers and workstations the bad Intel MT performance is much more visible

I disagree on this point. I would say this problem is much more critical on Intel's desktop platform than their workstation platform. Xeon Sapphire Rapids is actually very easy to cool, even on air, thanks to the CPU having a much larger surface to dissipate heat than their desktop equivalent.

I have Xeon w9-3495X, and while power consumption is one of its weakest points, it stays under 60°C with water cooling while I pump 500W into it (25°C ambient), of which I see between +30% to +50% gain in multithreaded performance over the default power limit. (Golden Cove needs around ~10W per core, so the default 350W/56c = 6.25W is way below its performance curve.) Noctua has also shown that they're able to achieve ~700W on U12S DX-4677[1] on this platform.

[1]: https://www.youtube.com/watch?v=dCACHpLzapc

jfindley|2 years ago

Not only is this complete and utter rubbish, it should have been obvious from context that we're not talking about desktop CPUs. 96-core desktop CPUs are not a thing, and neither of the product families I mentioned are desktop CPUs either. I neither know nor care what the difference between those desktop cpus are, and I doubt GP does either.

Your metric about clock speed is, I'm afraid to say, so horribly oversimpified as to be flat out wrong. You can't just multiply core count by clock speed like that, as you're failing to take into account all sorts of other scaling factors such as memory bandwidth, cache size, avx support and so on, which matter as much or more than simple IPS.