top | item 42467878

(no title)

dinartem | 1 year ago

So the PPC instruction set uses lwarx (load word and reserve indexed), and stwcx (store word conditional indexed), along with variations for word size, to implement atomic operations such as interlocked-increment and test-and-set.

So on PPC interlocked-increment is implemented as:

loop: lwarx r4,0,r3 # Load and reserve r4 <- (r3) addi r4,r4,1 # Increment the value stwcx. r4,0,r3 # Store the incremented value if still reserved bne- loop # Loop and try again if lost reservation

The idea is that the lwarx places a reservation on an address that it wants to update at some later time. It doesn't prevent any other thread or processor from reading or writing to that address, or cause any sort of stall, but if an address being reserved is written to, conditional or otherwise, then the reservation is lost. The stwcx instruction will perform the store to memory if the reservation still exists clears the NE flag, otherwise it doesn't do the write and sets the NE flag and software should just try again until it succeeds.

On the Xbox 360 we provided the compiler which would emit sequences like these for all atomic intrinsics, but developers could also write assembler code directly if they wanted to. We'll get back to this point in a moment.

As the V1 version of the Xbox 360 CPU was being tested by IBM, they discovered that an error with the hardware implementation of these two instructions and issued an errata for software to work around it, which we implemented. Unfortunately, after further testing IBM discovered that the errata was insufficient, so issued a second errata, which we also implemented and assumed all was well.

Then the V2 version of the CPU comes out and months go by. But early one morning I get a phone call from IBM letting me know that the latest errata was still insufficient and that the bug is in the final hardware. Further, Microsoft has already started final production of CPU parts, even before full testing was fully complete (risk buy), so that they could have sufficient supply for the upcoming November release. I was told that they are stopping manufacturing of additional CPUs, and that I had 48 hours to figure out if there is anything software can do that could work around the hardware issue. They also casually mentioned that millions of dollars of parts would need to be discarded, a hardware fixed implemented which would take weeks, then the production could resume from scratch.

Bottom line is that, yes, there was a set of software changes that would work around the bug, but it required very specific sequences of instructions, the disabling of interrupts around these sequences, a change to the hypervisor, and updating the compiler to emit the new sequences. To make sure that developers didn't introduce code sequences that uses lwarx/stwcx in a way that would expose the bug (via inline assembly, for example), the loader would scan the code and refuse to load code that didn't obey the new rules.

Interesting fact: the hardware bug existed in every version of the Xbox 360 ever shipped, because software needed to run on any console ever shipped, there was no advantage to ever fixing the bug since software always needed to work around it anyway.

discuss

order

markus_zhang|1 year ago

Thank you so much. This is so awesome to know and learn.

I'm just curious, what are the instructions that replace the lwarx/stwcx "atomic" pair? From my understanding, basic you need to generate a pair of load reserved/save instructions, and you have to replace the pair with a series of instructions. But I don't understand why do you have to disable interrupts -- is it because actually multiple instructions were used to facilitate the load, and an interrupt may disturb a value stored in a register?

Sorry I know little about assembly and arch.

dinartem|1 year ago

We still used the lwarx/stwcx pair to implement atomic operations, but to avoid the hardware bug a strict rule needed to be followed.

Rule: On a given hardware thread (there are two hardware threads per processor on the Xbox 360), every lwarx reservation of an address must be paired with a stwcx conditional store to that same address before a reservation is made to a different address. So a sequence like lwarx A / lwarx B / stwcx B / stwcx A is forbidden. But lwarx A / stwcx A / lwarx B / stwcx B is fine.

So I changed the compiler to emit atomic intrinsics that obeyed this rule.

But there was still the issue of logical thread scheduling. Imagine there are two logical threads running, one has a sequence of lwarx A / stwcx A and the other has lwarx B / stwcx B. The first thread is running on a hardware thread and just after executing lwarx A, the timer interrupt fires and the kernel decides to switch to the second logical thread, which executes lwarx B, and thus violates the rule.

To make sure that never happens, the compiler also emits disable-interrupts / lwarx A / stwcx A / enable-interrupts. That prevents the scheduler from switching threads in the middle of the atomic sequence.

But there was still one more problem. It is possible for a page-fault to occur in the middle of the sequence should it span the end of one page and the beginning of another, and the second page is not in the TLB. So the thread is running along and executes disable-interrupts / lwarx A, then when trying to fetch the next instruction it faults to the hypervisor because it isn't yet mapped by the TLB. The hypervisor executes a bunch of code to add the mapping of the new page to the TLB and then returns to the faulting thread to complete the stwcx A / enable-interrupts sequence.

The problem is that the TLB is a shared resource between the two hardware threads of a processor, so the two hardware threads need a way to atomically update the TLB, and the obvious way to do that is to use a spin-lock that is naturally implemented by a lwarx B / stwcx B pair of instructions. But the hypervisor TLB handler can't use those instructions because the code causing the TLB fault might be in the middle of using them and thus would cause the hardware bug to manifest.

The solution was to use non-reservation load/store instructions to implement a simple spin-lock. If the two hardware threads were both trying to update the TLB then hardware thread 2 would simply wait for hardware thread 1 to clear its lock before proceeding.