Users report parallel compiling is causing segfault on Ryzen Linux

[+] examancer|8 years ago|reply

Ryzen linux user. I haven't experienced these issues yet, but I have experienced a few growing pains with early BIOS revisions not being 100% stable for me, and RAM speed and timing challenges. Mostly resolved, though RAM speed is still slightly shy of XMP settings.

System is overclocked ([email protected]) and has been up and 100% solid for weeks now. 3.85 actually worked and tried to stress it by compiling a bunch of stuff. Didn't have any segfaults or other issues. Worked great.

Only after using an artificial stress tool (stress-ng) did I finally decide 3.85 was not 100% stable at stock volts. Backed off to 3.8 to avoid voltage increase for now. Haven't rebooted since.

The issues being reported do seem legitimate, however. Not sure if it's the memory controller having trouble with certain DDR4, the motherboards, or errata within the Ryzen CPU itself. All seem plausible. Hopefully AMD finds a resolution. In the meantime I'm glad I'm not affected.

[+] abbeyj|8 years ago|reply

This reminds me of the old bug in AMD K6 processors that also tended to only show up when doing long compiles in Linux and only when having more than 32MB of RAM. https://web.archive.org/web/20120515215109/http://membres.mu...

[+] octoploid|8 years ago|reply

It appears to be an issue with Ryzen's new micro-op cache and "CMP/TEST conditional jump" instruction fusion.

See comment from inuwashidesu in this thread: https://www.reddit.com/r/programming/comments/6f08mb/compili...

[+] i336_|8 years ago|reply

Direct permalink: https://www.reddit.com/r/programming/comments/6f08mb/compili...

Couldn't find the comment with that link for some reason - thanks for mentioning the username, had to get it from their account page :)

[+] i336_|8 years ago|reply

Computer science-y question.

Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).

Rethinking a bit, my 2nd take is to see if it's possible to somehow repeatedly synthesize workloads from (presumably smaller, more manageable amounts of) seed data.

One of the users in the AMD forum thread (I don't seem to be able to get a permalink) mentions that they're experiencing gcc crashes on Ubuntu inside VMWare on Win10! This means that the bug fits inside two kernels' preemptation/task scheduling and a hypervisor! Interesting.

What stumps me is that some users are experiencing gcc segfaults, while others are getting faults in `sh`.

...yeah this has me stumped. CPUs are so fast, and we have no idea where the problem is.

EDIT: This comment is interesting: https://www.reddit.com/r/programming/comments/6f08mb/compili...

[+] posterboy|8 years ago|reply

Since programms are non-deterministic, knowing the initial parameters should be enough, theoretically, but I am not sure if the internal translation of microcode is still deterministic. Considering the initial conditions for the kernel and every other part of the system would have to be reset in hardware and rebooted for every run until the bug is triggered, I'm not sure how feasible this approach would be.

Edit: My comment doesn't even get to your question. I mean, bug reproduction is quite hard before recording.

[+] dom0|8 years ago|reply

> Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).

Yes, the instruction fetching is quite a big part of all memory/cache reads :)

[+] todd8|8 years ago|reply

From my experience, tracing the execution paths is possible, but it isn't really logging every instruction.

To isolate the fault, one can start use a kind of binary search on the program containing the fault. By putting in one very light weight tracing instruction that records being executed one can look for the fault happening before the tracing instruction is executed or if the fault happens after the tracing instruction. The tracing instruction can be moved to the half of the code where the fault happened. By repeatedly dividing the code into smaller regions one can eventually narrow the location down to a small sequence of instructions that might contain the problem.

Of course, doing this for a problem like this involves overcoming a number of large obstacles. First, the fault the OP is talking about appears to be somewhat unpredictable. This means that we will have to keep records of the execution of the tracing instruction and need to have multiple tracing instructions in the code to see where the processor was really executing instructions when the fault happens. A good understanding of the code's organization as basic blocks (roughly sequences of machine instructions without branches) and some way of analyzing in a programmatic way the location where the fault must have occurred by looking at the counts of times that all of the different tracing instructions were executed will help in narrowing down the region where the fault happens. Compilers, like GCC, can be used to systematically instrument the code.

How can tracing instructions be of such low impact that they don't interfere with the fault being searched for? There is no guarantee that the attempt to measure or detect faults won't hide them, but light weight tracing can be done with pretty simple tracing hardware (simple that is for companies that make computers, like IBM or HP). Basically a device/card is plugged into one of the addressable buses (memory bus or I/O bus like PCI). The tracing hardware simple looks for bus addresses in a range of addresses that are reserved for tracing, say 8k of addresses allowing up to 8k tracing instructions to be scattered around the code. The tracing hardware can then record in its own separate memory the last few million of these addresses that appear on the bus in this unused range of address. The tracing instructions inserted into the program under test (in this case, say bash) will depend on the hardware, in this case, amd64. I'm not familiar enough with all of the new instructions available on the latest processors, but an instruction like set-memory-to-zero would work. The instruction doesn't really matter to the tracing hardware, it ignores the instruction it just looks for an address in the special range on the bus.

Even this fancy tracing hardware is too slow to use in the middle of loops running in registers or the cache, but by tracing the entrance and exit from such sequences the hardware/software causing the problems can isolated.

The same techniques are used to do debugging and performance tuning of operating systems, special hardware traces the operation of say the disk scheduler and a careful study of the relationship between the code responsible for scheduling operations on the disk drive and what shows up in the tracing is used to reduce the inefficiencies or problems in the low level drivers.

[+] jacquesm|8 years ago|reply

This could be a CPU problem but it could also easily be a memory subsystem or cooling issue. I really hope someone will get to the bottom of this soon and that it won't be a CPU issue, that could get expensive for AMD in a hurry.

Edit: and reading the comments in that thread it would be great if people would remark if they're running stock clocks and if they have upgraded their BIOS.

[+] c2h5oh|8 years ago|reply

It seems to happen on heavily overclocked CPUs. Phoronix user had managed to replicate the issue he wasn't experiencing by simply pushing it a bit further.

[+] aidenn0|8 years ago|reply

I run a 100% stock clock Ryzen 1700 with the most recent bios. It happened for me very reliably after ~45 minutes of compiling. It was nearly always a segfault in bash (and most of the time bash was running libtool).

CPU temperatures were in the upper 50s; downright cold compared to the rather hot (and old) Xeon CPU it replaced.

Interestingly enough, and I'm not the only one to report this, rebuilding the entire system with GCC 6.3 caused the problems to go away (I'm running Gentoo, so this was quite feasible). This is really odd because I was not using any AMD specific cflags, just the default x64 march.

I'm guessing the problem didn't actually go away, but rather the instruction scheduling of GCC 6.3 is less likely to cause whatever the underlying problem was.

[+] examancer|8 years ago|reply

There are multiple examples of people who are not overclocking at all and have gone to great lengths to ensure everything in their BIOS was properly configured. There does seem to be a real issue here. My money is on memory controller and issues with certain DDR4 modules. Hopefully something AMD can sort out with BIOS updates.

[+] dom0|8 years ago|reply

Tickets from users using overclocked components are generally INVALID CLOSED.

[+] bnolsen|8 years ago|reply

Always had problems with this in the old days with intel stuff when I used to overclock. There's a reason I don't overclock anymore...

[+] dryatu|8 years ago|reply

Not exclusively on overclocked CPUs.

segfaults happen even with factory defaults.

[+] unknown|8 years ago|reply

[deleted]

[+] hatsunearu|8 years ago|reply

Yeah, overclocking tends to fuck up certain corner cases--my friends once told me when you're debugging with a debugger you should turn off overclocking because that could really fuck up how the debugger works

[+] tscs37|8 years ago|reply

Hmm, I've compiled kernels several times on Arch Linux since I got my Ryzen build together and I haven't experienced this issue at all so far.

Might be affecting only a subset of users based on silicon?

[+] wyldfire|8 years ago|reply

> The issue is happening on multiple versions of GCC but I haven't seen any reports when using LLVM/Clang or alternative compilers.

So, still to be ruled out is a bug in GCC itself?

[+] qb45|8 years ago|reply

Considering that it happens only on Ryzen and with multiple GCC versions and that the first post in this Gentoo forum thread shows segfaults in bash (not gcc) it looks rather like a CPU bug manifesting itself under load.

Nobody talked about even trying clang, knowing Phoronix they just couldn't resist mentioning it for drama or SEO.

[+] gonzalezj|8 years ago|reply

Some users on Reddit are reporting that the problem goes away when they upgrade GCC: https://www.reddit.com/r/Amd/comments/6crru5/linux_instabili...

[+] raverbashing|8 years ago|reply

Whoever hits the issue should enable the creation of core files and analyse the corresponding backtrace

That would help identifying the concerned instructions

[+] Qantourisc|8 years ago|reply

I'm not sure I have had this, I had to turn down -J on some jobs, but not all, so could still be a software problem.

31 comments