(no title)
Zeetah | 1 year ago
You filed a bug report and then dug into them and used SBox to figure out what must have been going wrong.
The chip supplier came back with a workaround and within five minutes you simulated it on SBox and said it wouldn't work, why, and then said how it should be fixed.
The supplier didn't believe you as yet. And you worked out a workaround so we could be unblocked. Two weeks later they agreed with your fix...
maximilianburke|1 year ago
dinartem|1 year ago
So on PPC interlocked-increment is implemented as:
loop: lwarx r4,0,r3 # Load and reserve r4 <- (r3) addi r4,r4,1 # Increment the value stwcx. r4,0,r3 # Store the incremented value if still reserved bne- loop # Loop and try again if lost reservation
The idea is that the lwarx places a reservation on an address that it wants to update at some later time. It doesn't prevent any other thread or processor from reading or writing to that address, or cause any sort of stall, but if an address being reserved is written to, conditional or otherwise, then the reservation is lost. The stwcx instruction will perform the store to memory if the reservation still exists clears the NE flag, otherwise it doesn't do the write and sets the NE flag and software should just try again until it succeeds.
On the Xbox 360 we provided the compiler which would emit sequences like these for all atomic intrinsics, but developers could also write assembler code directly if they wanted to. We'll get back to this point in a moment.
As the V1 version of the Xbox 360 CPU was being tested by IBM, they discovered that an error with the hardware implementation of these two instructions and issued an errata for software to work around it, which we implemented. Unfortunately, after further testing IBM discovered that the errata was insufficient, so issued a second errata, which we also implemented and assumed all was well.
Then the V2 version of the CPU comes out and months go by. But early one morning I get a phone call from IBM letting me know that the latest errata was still insufficient and that the bug is in the final hardware. Further, Microsoft has already started final production of CPU parts, even before full testing was fully complete (risk buy), so that they could have sufficient supply for the upcoming November release. I was told that they are stopping manufacturing of additional CPUs, and that I had 48 hours to figure out if there is anything software can do that could work around the hardware issue. They also casually mentioned that millions of dollars of parts would need to be discarded, a hardware fixed implemented which would take weeks, then the production could resume from scratch.
Bottom line is that, yes, there was a set of software changes that would work around the bug, but it required very specific sequences of instructions, the disabling of interrupts around these sequences, a change to the hypervisor, and updating the compiler to emit the new sequences. To make sure that developers didn't introduce code sequences that uses lwarx/stwcx in a way that would expose the bug (via inline assembly, for example), the loader would scan the code and refuse to load code that didn't obey the new rules.
Interesting fact: the hardware bug existed in every version of the Xbox 360 ever shipped, because software needed to run on any console ever shipped, there was no advantage to ever fixing the bug since software always needed to work around it anyway.