top | item 40954690

(no title)

jclay | 1 year ago

I was seeing constant instability when running basically any large C++ build that was saturating all of the cores. I was getting odd clang segfaults indicating an AST was invalid, that would succeed on a re-run.

This was getting very frustrating, at various points I tried every other option online (including restoring bios to Intel Baseline settings), etc.

I came across Keean's investigations into the matter on the Intel forums:

> I think there is an easy solution for Intel, and that is to limit p-cores with both hyper-threads busy to 5.8GHz and allow cores with only one hyper-thread active to boost up to 5.9/6.2 they would then have a chip that matched advertised multi-core and single-thread performance, and would be stable without any specific power limits.

> I still think the real reason for this problem is that hyper-threading creates a hot-spot somewhere in the address arithmetic part of the core, and this was missed in the design of the chip. Had a thermal sensor been placed there the chip could throttle back the core ratio to remain stable automatically, or perhaps the transistors needed to be bigger for higher current - not sure that would solve the heat problem. Ultimately an extra pipeline stage might be needed, and this would be a problem, because it would slow down when only one hyper-thread is in use too. I wonder if this has something to do with why intel are getting rid hyper-threading in 15th gen?

From: https://community.intel.com/t5/Processors/14900ks-unstable/m...

Based on this, I set a P-Core limit to 5.8 in my bios and after several months of daily-use building Chromium I can say this machine is now completely stable.

If you're seeing instability on an i9 14900k or 13900k see the above forum post for more details, and try setting the all-core limit. I've now seen this fix instability in 3+ build machines we use so far.

discuss

jclay|1 year ago

I'll also add that I was never able to get the instability to show up when running the classic stress testing tools: MemBench, Prime95, and Intel's own stability tests could all run for hours and pass.

There's something unique about the workload of ninja launching a bunch of clang processes that draws this out.

On my machine, a clean build of the llvm-project would consistently fail to complete, so that may be a reasonable workload to A/B test with if you're looking into this.

The user quoted above was running gentoo builds on specific p-cores to test various solutions, ultimately finding that the p-core limit was the only fix that yielded stability.

xuejie|1 year ago

Just provide a not-related-at-all but IMHO still interesting case: I used to have a Kioxia CM6 U2 SSD drive, it would pass all sorts of benchmarks the reseller is willing to run, but whenever I tried to clean-compile Rust on it, the drive would fail almost certainly somewhere in the build process. While there are configurations you can compile Rust using pre-built LLVM, in my tests I'm compiling LLVM along the way. So I can agree with the comment here, there might be some unique property when doing multi-core compilations, though my tests show a potentially faulty drive, while the above comment here is about Intel CPU.

instagib|1 year ago

I set a limit for ninja/cmake to only run 4 or so cores when I was getting hang ups when doing a large compile.

Rename ninja to oninja, make an executable shell script in the ninja directory ninja.

#!/bin/sh oninja $@ -j4