Ryzen linux user. I haven't experienced these issues yet, but I have experienced a few growing pains with early BIOS revisions not being 100% stable for me, and RAM speed and timing challenges. Mostly resolved, though RAM speed is still slightly shy of XMP settings.
System is overclocked ([email protected]) and has been up and 100% solid for weeks now. 3.85 actually worked and tried to stress it by compiling a bunch of stuff. Didn't have any segfaults or other issues. Worked great.
Only after using an artificial stress tool (stress-ng) did I finally decide 3.85 was not 100% stable at stock volts. Backed off to 3.8 to avoid voltage increase for now. Haven't rebooted since.
The issues being reported do seem legitimate, however. Not sure if it's the memory controller having trouble with certain DDR4, the motherboards, or errata within the Ryzen CPU itself. All seem plausible. Hopefully AMD finds a resolution. In the meantime I'm glad I'm not affected.
Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).
Rethinking a bit, my 2nd take is to see if it's possible to somehow repeatedly synthesize workloads from (presumably smaller, more manageable amounts of) seed data.
One of the users in the AMD forum thread (I don't seem to be able to get a permalink) mentions that they're experiencing gcc crashes on Ubuntu inside VMWare on Win10! This means that the bug fits inside two kernels' preemptation/task scheduling and a hypervisor! Interesting.
What stumps me is that some users are experiencing gcc segfaults, while others are getting faults in `sh`.
...yeah this has me stumped. CPUs are so fast, and we have no idea where the problem is.
Since programms are non-deterministic, knowing the initial parameters should be enough, theoretically, but I am not sure if the internal translation of microcode is still deterministic. Considering the initial conditions for the kernel and every other part of the system would have to be reset in hardware and rebooted for every run until the bug is triggered, I'm not sure how feasible this approach would be.
Edit: My comment doesn't even get to your question. I mean, bug reproduction is quite hard before recording.
> Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).
Yes, the instruction fetching is quite a big part of all memory/cache reads :)
From my experience, tracing the execution paths is possible, but it isn't really logging every instruction.
To isolate the fault, one can start use a kind of binary search on the program containing the fault. By putting in one very light weight tracing instruction that records being executed one can look for the fault happening before the tracing instruction is executed or if the fault happens after the tracing instruction. The tracing instruction can be moved to the half of the code where the fault happened. By repeatedly dividing the code into smaller regions one can eventually narrow the location down to a small sequence of instructions that might contain the problem.
Of course, doing this for a problem like this involves overcoming a number of large obstacles. First, the fault the OP is talking about appears to be somewhat unpredictable. This means that we will have to keep records of the execution of the tracing instruction and need to have multiple tracing instructions in the code to see where the processor was really executing instructions when the fault happens. A good understanding of the code's organization as basic blocks (roughly sequences of machine instructions without branches) and some way of analyzing in a programmatic way the location where the fault must have occurred by looking at the counts of times that all of the different tracing instructions were executed will help in narrowing down the region where the fault happens. Compilers, like GCC, can be used to systematically instrument the code.
How can tracing instructions be of such low impact that they don't interfere with the fault being searched for? There is no guarantee that the attempt to measure or detect faults won't hide them, but light weight tracing can be done with pretty simple tracing hardware (simple that is for companies that make computers, like IBM or HP). Basically a device/card is plugged into one of the addressable buses (memory bus or I/O bus like PCI). The tracing hardware simple looks for bus addresses in a range of addresses that are reserved for tracing, say 8k of addresses allowing up to 8k tracing instructions to be scattered around the code. The tracing hardware can then record in its own separate memory the last few million of these addresses that appear on the bus in this unused range of address. The tracing instructions inserted into the program under test (in this case, say bash) will depend on the hardware, in this case, amd64. I'm not familiar enough with all of the new instructions available on the latest processors, but an instruction like set-memory-to-zero would work. The instruction doesn't really matter to the tracing hardware, it ignores the instruction it just looks for an address in the special range on the bus.
Even this fancy tracing hardware is too slow to use in the middle of loops running in registers or the cache, but by tracing the entrance and exit from such sequences the hardware/software causing the problems can isolated.
The same techniques are used to do debugging and performance tuning of operating systems, special hardware traces the operation of say the disk scheduler and a careful study of the relationship between the code responsible for scheduling operations on the disk drive and what shows up in the tracing is used to reduce the inefficiencies or problems in the low level drivers.
This could be a CPU problem but it could also easily be a memory subsystem or cooling issue. I really hope someone will get to the bottom of this soon and that it won't be a CPU issue, that could get expensive for AMD in a hurry.
Edit: and reading the comments in that thread it would be great if people would remark if they're running stock clocks and if they have upgraded their BIOS.
It seems to happen on heavily overclocked CPUs. Phoronix user had managed to replicate the issue he wasn't experiencing by simply pushing it a bit further.
I run a 100% stock clock Ryzen 1700 with the most recent bios. It happened for me very reliably after ~45 minutes of compiling. It was nearly always a segfault in bash (and most of the time bash was running libtool).
CPU temperatures were in the upper 50s; downright cold compared to the rather hot (and old) Xeon CPU it replaced.
Interestingly enough, and I'm not the only one to report this, rebuilding the entire system with GCC 6.3 caused the problems to go away (I'm running Gentoo, so this was quite feasible). This is really odd because I was not using any AMD specific cflags, just the default x64 march.
I'm guessing the problem didn't actually go away, but rather the instruction scheduling of GCC 6.3 is less likely to cause whatever the underlying problem was.
There are multiple examples of people who are not overclocking at all and have gone to great lengths to ensure everything in their BIOS was properly configured. There does seem to be a real issue here. My money is on memory controller and issues with certain DDR4 modules. Hopefully something AMD can sort out with BIOS updates.
Yeah, overclocking tends to fuck up certain corner cases--my friends once told me when you're debugging with a debugger you should turn off overclocking because that could really fuck up how the debugger works
Considering that it happens only on Ryzen and with multiple GCC versions and that the first post in this Gentoo forum thread shows segfaults in bash (not gcc) it looks rather like a CPU bug manifesting itself under load.
Nobody talked about even trying clang, knowing Phoronix they just couldn't resist mentioning it for drama or SEO.
[+] [-] examancer|8 years ago|reply
System is overclocked ([email protected]) and has been up and 100% solid for weeks now. 3.85 actually worked and tried to stress it by compiling a bunch of stuff. Didn't have any segfaults or other issues. Worked great.
Only after using an artificial stress tool (stress-ng) did I finally decide 3.85 was not 100% stable at stock volts. Backed off to 3.8 to avoid voltage increase for now. Haven't rebooted since.
The issues being reported do seem legitimate, however. Not sure if it's the memory controller having trouble with certain DDR4, the motherboards, or errata within the Ryzen CPU itself. All seem plausible. Hopefully AMD finds a resolution. In the meantime I'm glad I'm not affected.
[+] [-] abbeyj|8 years ago|reply
[+] [-] octoploid|8 years ago|reply
See comment from inuwashidesu in this thread: https://www.reddit.com/r/programming/comments/6f08mb/compili...
[+] [-] i336_|8 years ago|reply
Couldn't find the comment with that link for some reason - thanks for mentioning the username, had to get it from their account page :)
[+] [-] i336_|8 years ago|reply
Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).
Rethinking a bit, my 2nd take is to see if it's possible to somehow repeatedly synthesize workloads from (presumably smaller, more manageable amounts of) seed data.
One of the users in the AMD forum thread (I don't seem to be able to get a permalink) mentions that they're experiencing gcc crashes on Ubuntu inside VMWare on Win10! This means that the bug fits inside two kernels' preemptation/task scheduling and a hypervisor! Interesting.
What stumps me is that some users are experiencing gcc segfaults, while others are getting faults in `sh`.
...yeah this has me stumped. CPUs are so fast, and we have no idea where the problem is.
EDIT: This comment is interesting: https://www.reddit.com/r/programming/comments/6f08mb/compili...
[+] [-] posterboy|8 years ago|reply
Edit: My comment doesn't even get to your question. I mean, bug reproduction is quite hard before recording.
[+] [-] dom0|8 years ago|reply
Yes, the instruction fetching is quite a big part of all memory/cache reads :)
[+] [-] todd8|8 years ago|reply
To isolate the fault, one can start use a kind of binary search on the program containing the fault. By putting in one very light weight tracing instruction that records being executed one can look for the fault happening before the tracing instruction is executed or if the fault happens after the tracing instruction. The tracing instruction can be moved to the half of the code where the fault happened. By repeatedly dividing the code into smaller regions one can eventually narrow the location down to a small sequence of instructions that might contain the problem.
Of course, doing this for a problem like this involves overcoming a number of large obstacles. First, the fault the OP is talking about appears to be somewhat unpredictable. This means that we will have to keep records of the execution of the tracing instruction and need to have multiple tracing instructions in the code to see where the processor was really executing instructions when the fault happens. A good understanding of the code's organization as basic blocks (roughly sequences of machine instructions without branches) and some way of analyzing in a programmatic way the location where the fault must have occurred by looking at the counts of times that all of the different tracing instructions were executed will help in narrowing down the region where the fault happens. Compilers, like GCC, can be used to systematically instrument the code.
How can tracing instructions be of such low impact that they don't interfere with the fault being searched for? There is no guarantee that the attempt to measure or detect faults won't hide them, but light weight tracing can be done with pretty simple tracing hardware (simple that is for companies that make computers, like IBM or HP). Basically a device/card is plugged into one of the addressable buses (memory bus or I/O bus like PCI). The tracing hardware simple looks for bus addresses in a range of addresses that are reserved for tracing, say 8k of addresses allowing up to 8k tracing instructions to be scattered around the code. The tracing hardware can then record in its own separate memory the last few million of these addresses that appear on the bus in this unused range of address. The tracing instructions inserted into the program under test (in this case, say bash) will depend on the hardware, in this case, amd64. I'm not familiar enough with all of the new instructions available on the latest processors, but an instruction like set-memory-to-zero would work. The instruction doesn't really matter to the tracing hardware, it ignores the instruction it just looks for an address in the special range on the bus.
Even this fancy tracing hardware is too slow to use in the middle of loops running in registers or the cache, but by tracing the entrance and exit from such sequences the hardware/software causing the problems can isolated.
The same techniques are used to do debugging and performance tuning of operating systems, special hardware traces the operation of say the disk scheduler and a careful study of the relationship between the code responsible for scheduling operations on the disk drive and what shows up in the tracing is used to reduce the inefficiencies or problems in the low level drivers.
[+] [-] jacquesm|8 years ago|reply
Edit: and reading the comments in that thread it would be great if people would remark if they're running stock clocks and if they have upgraded their BIOS.
[+] [-] c2h5oh|8 years ago|reply
[+] [-] aidenn0|8 years ago|reply
CPU temperatures were in the upper 50s; downright cold compared to the rather hot (and old) Xeon CPU it replaced.
Interestingly enough, and I'm not the only one to report this, rebuilding the entire system with GCC 6.3 caused the problems to go away (I'm running Gentoo, so this was quite feasible). This is really odd because I was not using any AMD specific cflags, just the default x64 march.
I'm guessing the problem didn't actually go away, but rather the instruction scheduling of GCC 6.3 is less likely to cause whatever the underlying problem was.
[+] [-] examancer|8 years ago|reply
[+] [-] dom0|8 years ago|reply
[+] [-] bnolsen|8 years ago|reply
[+] [-] dryatu|8 years ago|reply
segfaults happen even with factory defaults.
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] hatsunearu|8 years ago|reply
[+] [-] tscs37|8 years ago|reply
Might be affecting only a subset of users based on silicon?
[+] [-] wyldfire|8 years ago|reply
So, still to be ruled out is a bug in GCC itself?
[+] [-] qb45|8 years ago|reply
Nobody talked about even trying clang, knowing Phoronix they just couldn't resist mentioning it for drama or SEO.
[+] [-] gonzalezj|8 years ago|reply
[+] [-] raverbashing|8 years ago|reply
That would help identifying the concerned instructions
[+] [-] Qantourisc|8 years ago|reply