top | item 45516000

We found a bug in Go's ARM64 compiler

835 points| jgrahamc | 5 months ago |blog.cloudflare.com

140 comments

order
[+] Neywiny|5 months ago|reply
That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.

Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it

[+] Veserv|5 months ago|reply
You would normally use the “LDR Rd, =expr” pseudo-instruction form [1]. For immediates not directly constructible, it puts a copy of the immediate value in a PC-relative memory location, then does a PC-relative load into register.

So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.

[1] https://developer.arm.com/documentation/dui0801/l/A64-Data-T...

[+] pklausler|5 months ago|reply
I'm a little surprised that this bug wasn't fixed in the assembler as a special case for immediate adds to RSP. If the patch was to the compiler only, other instances of the bug could be lurking out there in aarch64 assembly code.
[+] bloak|5 months ago|reply
> So another win for being able to read arm assembly.

Yes, though that weird stuff with dollars in it is not normal AArch64 assembly!

The article could have mentioned the "stack moves once" rule.

[+] pjmlp|5 months ago|reply
Usually in runtimes like Java and .NET there are safepoints exactly to avoid changing context in the middle of a set of instructions.
[+] titzer|5 months ago|reply
I think the right fix is that the compiler should, e.g. load the constant into a register using two moves and then emit a single add. It's one more instruction, but then the adjustment is atomic (i.e. a single instruction). Another option is to do the arithmetic in a temp register and then move it back.
[+] pengaru|5 months ago|reply
[+] chavi2|5 months ago|reply
One thing I worry about, probably unnecessarily, is anything with a sense of urgency.

HEY GUYS WE JUST FOUND A GOLANG COMPILER BUG AND FATAL PANICS!

Everyone is like “Hmm. I need to fix this now.”

So, 99% probability it’s what it is. 1% it’s some secret defensive thing because there was a bad stupid zero day someone would get fired over or that could leave the world in shambles if uncovered, or maybe something else needed to be swept under the rug, or maybe someone wants to distract while they introduce a new vulnerability.

I don’t think this with CVEs, but when someone’s like “install this patch everybody!” the dim red light flickers on.

[+] Vipsy|5 months ago|reply
One thing that often gets missed is how hard it is to even suspect the compiler as the root cause. Most engineers waste hours chasing bugs in their own code because we’re trained to trust our tools. This mindset alone can make these rare compiler bugs much trickier to find.
[+] pjmlp|5 months ago|reply
In the early PC days we suspected them a lot given how manually writting Assembly was still much better, in many cases.

I found out a bug on Turbo Pascal 6, where if you declare a variable with the same name as the function name, then the result was random garbage.

For those that don't know Pascal, the function name has to be assigned for the result value, so if a local variable with the same name is possible, then you cannot set the return value.

Something like this https://godbolt.org/z/s6srhTW66

    (* In Turbo Pascal 6 this would compile *)

    function Square(num: Integer): Integer;
    var
        Square: Integer;

    begin
        Square := num * num; (* Here the local variable gets used instead *)
    end;
[+] Tor3|5 months ago|reply
In the past it was more common to suspect the compiler, as others mention here. On a minicomputer I worked with in the late eighties, early nineties, I occasionally found errors in the compiler output. This was a Pascal compiler and because of that it didn't take too long to figure out that the code was actually correct and something else must be going on. Then firing up the debugger/tracer and scrutinizing and analyzing what happens in the disassembly.. when the problem was found, send a fax (yes!) to the head designer of the compiler, get a fixed test compiler back on a set of floppies.. went through this several times. I still have a printout somewhere with my pen marks pointing out a bug in the generated code.
[+] kmarc|5 months ago|reply
There are certain professions where the compilation process is (ab)used to optimize to a point where these bugs seemingly surface more often.

In the HFT sphere i haven't talked to a company that hasn't reported (bragged about finding) a super weird gcc/clang bug.

Well, also, at my last job we used a snapshot version of the compiler, bc... Any nanoseconds matters.

[+] renewiltord|5 months ago|reply
Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.

Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.

I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.

[+] quotemstr|5 months ago|reply
This problem strikes me more as a debuginfo generation bug than a "compiler" bug.

> After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.

Seems silly to pessimize the runtime, even slightly, to account for the partial register construction. DWARF bytecode ought to be powerful enough to express the calculations needed for restoring the true stack pointer if we're between immediate adjustments.

[+] sauercrowd|5 months ago|reply
> This problem strikes me more as a debuginfo generation bug than a "compiler" bug.

But isn't that the same thing here? The bug occurred in their production workflows, not in some specific debug builds, so with that seems pretty reasonable to call it a compiler bug?

[+] riobard|5 months ago|reply
What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.
[+] zamadatix|5 months ago|reply
I'm not Cloudflare, I just read their blog too much. As they hint in the article when mentioning secure boot, they've been deploying Ampere in parallel to AMD for several years now. Purpose wise it seems to be Edge related for efficiency reasons, but maybe they use them for other things too. You can read some more here https://blog.cloudflare.com/designing-edge-servers-with-arm-... and here https://blog.cloudflare.com/arms-race-ampere-altra-takes-on-... along with the original evaluation of Qualcomm here https://blog.cloudflare.com/arm-takes-wing/
[+] EE84M3i|5 months ago|reply
I seem to recall Cloudflare hosts their some of their non-edge compute on public clouds? Like control plane stuff. Could be that.
[+] MarkSweep|5 months ago|reply
I wonder if Go had a mode where you make it single step every instruction and trigger a GC interrupt on every opcode. That would make it easier to find these kinds of bugs.
[+] wy1981|5 months ago|reply
Great find and writeup.

As an aside, this is the type of a problem that I think model checkers can't help with. You can write perfect and complicated TLA+/Lean/FizzBee models and even if somehow these models can generate code for you from your correct models you can still run into bugs like these due to platform/compiler/language issues. But, thankfully, such bugs are rare.

[+] alberth|5 months ago|reply
I thought Cloudflare was 100% Rust, and x86 (EPYC) these days.

Interesting to hear Go & ARM in use.

[+] surajrmal|5 months ago|reply
I doubt any company is mono language at that scale. Using ARM usually makes sense for s lot of horizontal scaling workloads so it's also not that surprising.
[+] steveklabnik|5 months ago|reply
Cloudflare has long kept Arm builds of everything even when they deployed to x86 only, to make it easy to switch when it made sense.

And yeah, a lot of Rust but also a lot of Go.

[+] dreamcompiler|5 months ago|reply
Always adjust your stack pointer atomically, kids.
[+] whizzter|5 months ago|reply
I guess those that wrote the preemption were on X86 where this doesn't happen thanks to variable length instructions being able to hold the constant and thus relied on the code-gen to do it atomically, then the ARM port had an automatic "split" from a higher level to make things "easy" thus giving us this bug.

Nobodys fault really, but bad results ensued.

[+] drob518|5 months ago|reply
Exactly what ran through my mind.
[+] brcmthrowaway|5 months ago|reply
I don't get it, how were the machine threads being stopped in thr middle of two instructions? This is baremetal, right?
[+] adgjlsfhk1|5 months ago|reply
go uses interrupts for GC notifications
[+] Agingcoder|5 months ago|reply
Excellent article as always from the cloudflare blog - engineering without magic infrastructure and ml. One day I will apply !

Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.

[+] pfdietz|5 months ago|reply
I see something like this and I wonder "what testing methodology would have found this?" It has to be general, not something that would involve knowing what the bug was ahead of time.
[+] syncsynchalt|5 months ago|reply
When your scale is large enough, you move to "what monitoring methodology will find this?"

When you're doing enough transactions you start to see a noise floor of e.g. bit flips from cosmic rays, and looking for issues involves correlating/categorizing possible software failures and distinguishing them from the misbehavior of hardware.

[+] javierhonduco|5 months ago|reply
Really enjoyed reading this. Thanks for writing it!
[+] maguro_01|5 months ago|reply
An x86-64 Windows 11 machine trying to access a previously available Website now always produces a Cloudflare "obsolete protocol" error on the ordinary attempt. Al browsers get the same error. Did your fix break something?
[+] Bengalilol|5 months ago|reply
I always appreciate articles like this, where you can clearly see the engineer’s way of thinking.

I was just puzzled by the middle part of the article, where they start investigating their code but seem to overlook the fact that it only happens on ARM64.

Still, I understand that it’s professional to proceed step by step logically.

Great article, it was a pleasure reading it!

[+] bradley13|5 months ago|reply
I find it interesting, how rare it has become to find s compiler bug. For me, at least, it used to be a regular event.

Even Java, as widespread as it is, I have made half-a-dozen reports. None in the last several years, though.

Better testing? The sheer scale of software being produced?

[+] lou1306|5 months ago|reply
Linus's law [1]? When it comes to compilers for mainstream languages, the userbases are so large that they will explore a surprisingly large portion of the compiler's state space.

But definitely, better engineering and QA practices must also help here.

[1] https://en.wikipedia.org/wiki/Linus%27s_law

[+] wat10000|5 months ago|reply
I would have thought that unwinding would use the frame pointer and this wouldn't be a problem.
[+] mperham|5 months ago|reply
The frame pointer was updated non-atomically in two asm ops. An async interruption between the two ops would lead to a corrupt frame pointer.
[+] mperham|5 months ago|reply
Did they ever explain why netlink was involved? Or was that a red herring?
[+] Sesse__|5 months ago|reply
The stack in that specific function was big enough to trigger the bug.
[+] drob518|5 months ago|reply
Seemed like a red herring. They were able to reproduce it without any libraries. Might have just been net link forcing the stacks to a certain size and that made the bug visible.
[+] syncsynchalt|5 months ago|reply
The netlink function uses a larger stack than most.

Their repro case required a stack adjustment larger than 1<<12 (4kiB).

[+] yalok|5 months ago|reply
Classic problem of non-atomic stack pointer modification.

Used to have a lot of fun with those 3 decades ago.