It Is Never a Compiler Bug Until It Is

[+] jtolmar|5 years ago|reply

My first compiler bug was in my first year at Google. I'd just introduced a new system for the animation that updates your position while driving in Google Maps. It was perfectly buttery smooth as planned, except on my manager's commute the next day, where it constantly lurched back and forth. The others on the team were convinced that it had to be something with my code, but I didn't think it could be, because my code had no conditional statements and should either always be right or always be wrong.

It kind of looked like it was being fed nonsense speed values, so I got the GPS log from my manager and checked - but no weird speed values, actually a remarkably clean GPS log. Replayed his GPS on my phone - worked perfectly fine, buttery smooth. Eventually it came out that it only happened on my manager's phone. Borrowing said phone and narrowing things down with printf, I showed that my core animation function was being called with the correct values (a, b, c, d) but was being run with the wrong ones (a, a, c d). This is when my manager thought to mention that he was running the latest internal alpha preview of Android.

Searching Android's bug tracker for JIT bugs, I found that they had a known register aliasing bug. Honestly I have no idea how it ran well enough to get to my code in the first place. But I tagged my weird animation bug as related to that (they didn't really believe me) and ignored it until they fixed their thing, at which point it went away.

[+] sebazzz|5 years ago|reply

JIT bugs must be horrible to debug because they happen in full running programs and might not always result in crashes. RyuJIT of dotnet had a bug a few years ago which caused certain calculations on certain specific conditions to go wrong.

[+] ocfnash|5 years ago|reply

Compiler bugs are indeed pretty frightening. A few years ago I bumped into one in some code that had potential to have a big impact. Unfortunately I am not at liberty to give details about the business setting except to say that we had processes in place that prevented any danger.

In the end I whittled it down to the following tiny C# program:

  namespace UhOh
  {
    internal class Program
    {
      private static void Main()
      {
        System.Console.WriteLine(Test(0, 0));
      }
      private static bool Test(uint a, uint b)
      {
        var b_gte_a = b >= a;
        var b_gt_a = b > a;
        System.Console.WriteLine(b_gte_a);
        return b_gte_a && b_gt_a;
      }
    }
  }

Compiling and running this with Microsoft's .NET stack with versions 4.7.0 and below, the output was incorrectly: "True, True" instead of "True, False". (IIRC, it also had to be a 64-bit Release build.)

The intermediate language was correct; it was a bug in RyuJIT.

[+] noitpmeder|5 years ago|reply

I imagine some really interesting bugs must be encountered in the HFT/MM world. Risk controls are there for a reason :)

[+] 8fingerlouie|5 years ago|reply

A couple of decades ago i worked with GSM handsets, and because you certify the GSM stack along with the compiler, you're pretty much tied to a specific compiler version.

At the time we were working on a new "feature phone" with a 160x120 pixel display in 4 shades of grey, which was a huge upgrade compared to our previous models. Another feature was full screen images for the various applications, and we'd been implementing is into the software and testing it for weeks without problem. After the development cycle came to an end, our release team created a new software release and sent it to our test department, which almost instantly reported graphical errors back to us. We tested the software image on our own handsets, and half the screen was "garbage".

We spent weeks looking over the merged code, as well as debugging pointers code and found nothing. It wasn't until we were stepping through the paint code with a Lauterbach debugger that we noticed something was "off" with a pointer.

The platform was a 16 bit platform and memory was addressed using a page pointer pointing to a 64kb memory page. When we traversed the bitmaps fate would have it that this particular bitmap, in this particular build, was split between two pages, and when incrementing a pointer beyond the page limit, instead of incrementing the page pointer, it simply just overflowed and started from the beginning of the current page.

Another interesting bug we chased in the compiler was it's inability to add more than 2 variables in initial assignment.

i.e.

  int a = 1+2;        // a = 3
  int b = 1+2+3;      // b = 3
  int c = 10 + a + b; // c = 13

That took a while to figure out.

[+] hyperman1|5 years ago|reply

Long ago, I bought the MIX C compiler for DOS. It had a tendency to miscompile basic things like the ++ operator. Lost hours of my life on it.

One day I found out about DJGPP, and even if the download cost a fortune in phone time, life was so much better.

Now the C reference book included with MIX C turned out to be much better than the compiler itself, and served me well for a decade.

[+] widforss|5 years ago|reply

The first example sounds like a bug that could sneak in anywhere, but the second one makes me question how they laid out the compiler architecture.

Did they write a special expression parser just for declarations? I mean, this shouldn't be possible if there was just one expression parser?

[+] ptsneves|5 years ago|reply

Of note to me: certified compiler was actually non performing :) Lauterbach, the savior of embedded wizardry. Still in use today with chips one would not consider "embedded".

[+] jcadam|5 years ago|reply

I've encountered one genuine compiler bug in my (now 14+ year) career.

I was working on a defense contract, on a government system, where I was constrained by local IA policy to specific versions of various tools, including a relatively ancient version of gcc.

I can't recall exactly what the problem was, but I do remember figuring out after doing some research that the bug that was biting me had been identified and fixed in a later version of gcc. Which I was not allowed to install. So I had to implement a hack-tastic workaround anyway.

One of the best parts about that job - I was integrating the tool I was writing (in Python with Tk - it was the only installed and approved GUI library I could use) with a really old signal analysis library that had originally been written for VMS back in the day - then ported to SPARC/Solaris - then ported again to x86 (yes, VMS heritage was evident in places). Through many years of flowing through many different maintenance contractors, the library had become a giant amalgamation of Ye Olde FORTRAN, C, C++, and Python. To build it I needed a specific version of the Intel FORTRAN compiler, which my employer would not purchase, and the client IA policy would not allow on their system anyway. With much hackery, I managed to coax the damn thing into building using the "approved" gfortran that was already on the network.

Egad, what a horrible job that was.

[+] pjc50|5 years ago|reply

I slightly regret leaving the writeup of this bug to my old employer, but it was a spectacularly hard to identify one.

Windows (including CE) has a crash detection system called "structured exception handling". You can use this to route segfaults to either "try/catch" (don't) or "__try/__except" special handlers. We had one of these to log the error and show a crash screen. It worked fine on the desktop. It worked fine on the MIPS systems. On ARM systems, it sometimes didn't work. At these times, the debugger didn't work properly either.

I eventually found the kernel stack unwind code (with WinCE you get some, but not all, of the kernel source). For some reason this was looking at the jump instructions on the way down, and the particular problem was virtual method calls were implemented as "LDA pc, [r3]" after computing the target vtable location in r3.

r3 is in the "clobberable" set of registers in the ARM calling convention, so if it got overwritten lower down the call stack the unwind handler would read the wrong value and fail to follow the stack.

Fortunately it turns out there were two different versions of the ARM compiler, shipped in different toolkits by Microsoft (why? who knows) and using the other one didn't trigger the bug.

We checked the known-good compiler into source control and kept it there as critical build infrastructure.

[+] jfb|5 years ago|reply

I worked with a guy who had in fact once found a javac compiler bug, and he was maddening to try and get to fix bugs, because he'd just always point at the compiler.

[+] Aperocky|5 years ago|reply

What's worse is that those policy usually means the product will have more defect and easier to attack.

I've heard of story where government standard explicitly demanded a lower key size of encryption, way below current industry standard, for 'secure' applications.

[+] DannyB2|5 years ago|reply

I have encountered two compiler bugs in my (almost) 40 year career.

In about 1988, a bug in Apple's MPW Pascal compiler. I refused to believe it was a compiler bug, until I finally inspected the generated code. IMO the only way to be taken seriously about a compiler bug is to distill the defective compiler behavior down to a short (like one page) example. Also helpful is to show the generated code and how it is wrong. Bug was acknowledged and fixed.

In the mid 1990s, dabbling with C++ on (classic) Mac, I upgraded to a (name brand withheld) version 8.0 C++ compiler. The generated code behavior was obviously wrong. To make matters worse, it was possible to crash the compiler. The number of problems with that compiler were so bad that I simply ditched that compiler, and it didn't last much longer commercially. Sad, because its predecessor compilers (mostly C and Pascal) had been very good.

[+] samatman|5 years ago|reply

Vernor Vinge's software archaeologists are already among us.

I wonder what percentage of the job market that niche will be in fifty years.

[+] jcelerier|5 years ago|reply

> I've encountered one genuine compiler bug in my (now 14+ year) career.

you're so lucky. between just 2014 and 2018 I reported something like 30 bugs to msvc, gcc, clang (in decreasing order)

[+] vardump|5 years ago|reply

When something doesn't work as expected, I'll often check disassembly. That can massively cut troubleshooting time when something "smells" like a compiler bug.

This is why preferably everyone should learn to read assembler output. This is not limited to C/C++/Rust/etc. native code, the same output is typically also available for example for JVM and Javascript JIT.

Haven't found any miscompilations so far (unless you count braindead codegen), but quite a few hardware bugs. Including one CPU bug.

[+] DoofusOfDeath|5 years ago|reply

I agree. Also, depending on the compiler you're using, you may have some additional opportunities to look beneath the covers.

For example:

- With Clang, you can dump the C/C++ AST and/or the LLVM IR.

- With GCC, you can dump the Gnu Assembly (with source-level annotations).

These views can be helpful, especially for someone unfamiliar with the target machine's instruction architecture and ABI.

[+] searchableguy|5 years ago|reply

Do you have any recommendations on resources for picking up assembly and learning more about JITs?

[+] unknown|5 years ago|reply

[deleted]

[+] sokoloff|5 years ago|reply

Had a very junior engineer who had just started report a compiler bug to our all-technical-staff mail list. He was testing the not-yet-released next version (of tcl), so it was possible but we had appropriate skepticism and someone on list asked for the smallest reproduction case.

Few hours later, he verified and produced on-list a reproduction case where a variable could not be incremented by 1 but could by 2 or any other number.

Turns out he’d been taught in typing class that l (lowercase L) could be used for 1 and carried that into computing.

WONT_FIX

[+] msla|5 years ago|reply

> Turns out he’d been taught in typing class that l (lowercase L) could be used for 1 and carried that into computing.

A good way to keep them busy would have been to demand they type an exclamation point.

(Hint: Backspace-and-overstrike a period with a single quote.)

[+] Sniffnoy|5 years ago|reply

When did this happen? It's hard to imagine this happening anytime in, say, the past 15 years...

[+] frankjr|5 years ago|reply

Oh compilers are fun. Just recently I was reading through Rust's bug tracker, as one does, and learned that comparing function pointers is not deterministic. Compiling this code [0] in the Debug mode yields different results than in the Release mode. You can read the whole discussion about whether it's a LLVM bug, a Rustc bug, an undefined behavior, an intended behavior, a pretty serious bug, or nothing to worry about over at [1].

[0] https://play.rust-lang.org/?version=stable&mode=release&edit...

[1] https://github.com/rust-lang/rust/issues/54685

[+] account42|5 years ago|reply

> comparing function pointers is not deterministic

The comparison is deterministic - the perhaps unexpected part is that two distinct but identical functions in the source code are folded into one in the binary.

[+] chc4|5 years ago|reply

God, that issue is a mess. Lots of people missing the forest (Rust and LLVM are breaking constant folding guarantees) for the trees (function pointer equality is weird).

[+] dathinab|5 years ago|reply

It should be noted that function pointer comparison is defined to not be deterministic between different compiler option sets, it's at least for now possible but mostly useless.

Through there are bugs involved, too ;=)

(Due to const folding comparisons vs. doing them at runtime there was/is non-determinism in the same build between different call sites ... )

[+] Asraelite|5 years ago|reply

Just a heads up, you linked to the main Rust Playground page, not an actual gist.

[+] rkeene2|5 years ago|reply

Since we are sharing stories about bugs we ran into in compilers...

I once ran into a bug where "bash" would run commands out of order. It wasn't hard to trigger the bug, but it wasn't deterministic either.

When I first noticed the bug on my production systems it drove me insane, since the logs being generated were impossible. It took me a weeks to figure out that bash was running commands out of order.

Then, when I tried to report this bug, I ran into a lot of resistance. First over IRC, nobody believed this could possibly be happening -- and I was eventually directed to the mailing list [0], where the maintainers were initially not able to replicate it, but eventually more required elements were identified and the bug was fixed.

[0] https://lists.gnu.org/archive/html/bug-bash/2015-06/msg00010...

[+] nuriaion|5 years ago|reply

Some years ago i worked on a part of a specialized steering system for a car. This was done with certified everything (Certified Compiler, Processor, a lot of paper work etc.)

This was a 16-Bit processor and the C-compiler had a "funny" bug. If you had a struct with 3 8Bit Values in a row and a 16Bit Value afterwards it would overlap the 8Bit Value with the 16Bit value:

  struct {
    int8 a;
    int8 b;
    int8 c;
    int16 d;
  }

In this case the variable c and d would have the same address. This was on a cpu where we didn't had a debuger (not enough memory left for it), we only had a serial port for debuging.

[+] nullc|5 years ago|reply

I guess someone told the compiler authors that the automotive industry was a unionized industry.

[+] ncmncm|5 years ago|reply

A serial port is all you need, if you have room for a wee GDB stub. Then you get the full power of GDB.

I do this routinely where the target has 256GB of RAM, and (not incidentally) specialized network hardware, but no dev infrastructure except gdb-server (which provides the stub) and sshd. I build in a docker image that matches the target, but with dev tools, with the output bin directory sshfs-mapped to a directory on the target. I run the binary on the target under gdb-server, opening a socket listener. Then I run gdb natively on my dev machine, and `target remote server:61231` to attach to that socket. If I didn't have easy access to listening ports on it, I could ssh-tunnel one in.

So, a serial port and small RAM doesn't have to mean you have no debugger.

[+] dehrmann|5 years ago|reply

My first was at my first job out of school. It was a bit of an adventure telling my manager. It was in C, but with an old GCC version on an architecture like MIPS. My code would seemingly never run through a switch statement correctly, but it worked fine with if statements. Luckily and unluckily, the company was large, ran a custom GCC build from a third party and had a support contract. When I filed the bug, they said "there's a known issue with large jump tables on that GCC version, disable some optimization with this flag."

I think that made me just a little paranoid. I generally trust things, but depending on their popularity and likely it is my code path is run by lots of users, I realize library (and compiler!) bugs happen.

[+] ncmncm|5 years ago|reply

I have found bugs in Gcc, and reported them. I check on them once every few years to see if anything at all has happened on any of them. It seems worth distinguishing code-generation bugs from other compiler bugs. Most of my Gcc bugs are not code-generation bugs.

Back in the '80s, the C++ compiler was `cfront`. We spent half of every day bisecting source files to identify the line that would crash the compiler, and doctor it to step around the bug.

People who used to use the Lucid compilers said they were happy when Lucid flopped, because from then on their compiler only had known bugs, instead of a new crop every few months.

Things are better, nowadays, with compilers.

[+] lordnacho|5 years ago|reply

The problem with compiler and standard lib bugs is it's the last thing you suspect. You're always going to look at your own code first, because 99/100 it's you and not them. You're never going to immediately think "compiler bug", your first port of call is gonna be "I must be using the API wrong".

I discovered a bug in the Swift standard lib once, and it took ages before I got to the point where I decided to strip out my own code, just to make sure it was me. And it wasn't, there was genuinely something wrong in the lib that other people on SO were also able to reproduce.

Good on him for finding a bug in secp256 too. When it comes to cryptography code, it can be very hard to know what the right answer is. I always find some examples on the internet and put them in a unit test to make sure I'm not misusing the API, because if you do your answer looks the same: bunch of numbers in a byte array. To know that your numbers are wrong, you need to be sure you are testing them correctly. Which you can't be if you don't know if you're using the API correctly.

[+] sloucher|5 years ago|reply

I've worked with people for whom "compiler bug" is their first port of call...

[+] saagarjha|5 years ago|reply

> The problem with compiler and standard lib bugs is it's the last thing you suspect.

Well, until you start suspecting hardware bugs.

[+] HelloNurse|5 years ago|reply

I got a fun compiler ignorance bug.

Me: I have memory corruption when I call your API. IBM: trust us, our API DLL is perfectly compatible with your old Windows 32 bit client program! We changed nothing! Me: I have stack overruns. 4 bytes of return value from you overwrite 4 bytes of variables, whatever I declare last in my function. IBM: look at the source of our API façade! It's unchanged! (it was, except for harmless additions). Me: your compiled code is fairly similar, but the return value is bigger. (At this point, I was already on very friendly terms with Ghidra and with the Visual Studio remote debugger.) IBM: we just recompiled our code!

But they recompiled it with a newer compiler: time_t had changed from 32 to 64 bits, changing the size of the returned unions in their DLL but not in my client.

[+] read_if_gay_|5 years ago|reply

> As I rushed to recompile my computer system using GCC 8, I contemplated the vast consequences of such a bug could be, and pondered how it was possible that computers could function at all.

This hits home.

[+] bregma|5 years ago|reply

I maintain embedded development C and C++ toolchains for a living. I have seen my share of compiler bugs. For example, some optimization pass in a popular open-source compiler that would lose track of dereferences of pointer variables if they were more than 12 bytes deep in the stack, meaning that a reference capture in a C++ lambda would get converted to a value capture if it was the third or later capture in order and changes to the referenced capture would be lost....

Anyway, my experience is that compiler bugs do exist, but maybe 99% or so of "compiler bugs" reported by my users turn out to be undefined behaviour in their code.

[+] coldtea|5 years ago|reply

Note also that "it's never a compiler bug" applies more to things like GCC and so on.

If you're working with a new language or quickly changing, e.g. Nim, Crystal, etc, or even something as old as Rust, then it can much more easily just be a compiler bug...

[+] mark-r|5 years ago|reply

Yes, the older and more popular a compiler is, the less likely it is to have bugs.

The buggiest compiler I ever used was a C compiler that ran on a PC and generated code for the 68000 processor. We seemed to trip over something about once a month.

[+] jibal|5 years ago|reply

Yes, the Nim compiler is a disaster area, especially if you do lots of templates and macros.

[+] inglor_cz|5 years ago|reply

Maintaining an application that still has Symbian users (some people are really conservative and like their Nokia E52s, plus there isn't as much malware for this dead OS), bugs in old GCCE are rather annoying.

Sometimes, for no reason, GCCE just crashes compiling totally innocent code. Usually, a minor rewrite of the logic helps, or even weird edits such as adding a new (useless) parameter to a method.

The last GCCE toolchain for Symbian was released by CodeSourcery in March 2012. It contains GCC version 4.6.3. It is theoretically possible to adapt and compile a newer version, but the sources need so many edits that I gave up after a few days.

[+] speeder|5 years ago|reply

Symbian in Brazil was rapidily climbing in popularity when MS killed it.

It had 60% of phone market share and it was RISING despite the launch of iPhone and Android.

The reason is that:

  1. It worked great.
  2. Brazil had a vibrant dev community (people would even port PC games to Symbian O.o)
  3. It was much cheaper than iPhone and clones.
  4. Nokia phones were just solid and awesome.

After MS made that memo that killed Symbian, it died almost instantly, people got so disappointed with MS that they started to switch to android, even if it was some chinese-made "shit-phone" instead of "feature-phone", the amount of really, really crappy androids that flooded the market was mind boggling, many didn't even work right, for example wouldn't complete calls properly or wouldn't connect to some Wi-Fi channels.

[+] matharmin|5 years ago|reply

A couple of years back I ran into a JDK JIT bug during a project. The code ran fine until I ran it through benchmarks, which triggered JIT on a method causing it to return incorrect results.

Took a long time to find, because there were no errors, just wrong results (a specific if statement taking the wrong branch).

Trying to get assistance from others were mostly met with responses along the lines of "It's probably a race condition" (in single-threaded code) / "very unlikely to be a bug in the JIT". I did end up finding a way to disable JIT for the specific method, which solved the issue, and never got around to finding the root cause. I do believe it has been fixed in the meantime at least.

I haven't run into major compiler bugs since then, but often have to dive deep into libraries to find obscure bugs (database drivers and web servers most often).

[+] JoeAltmaier|5 years ago|reply

Ah kids these days. New compilers used to be written every year or so, and had the most horrific bugs. For instance, the one that reordered complex 'if' conditions to evaluate in chunks, ignoring precedence. Or the compiler that stored parameter context while compiling with a different lifetime than the actual one - resulting in references to deleted memory during compilation. And on and on.

Used to be, a compiler bug was right up there with a memory issue in your list of 'what might be wrong'.

[+] jjice|5 years ago|reply

My immediate tangential thought is about Ken Thompson's paper, "Reflections on Trusting Trust".

The gist is that our security issues could come layers away from where we may expect them to, all the way up to the compiler. It's a great paper, but who would have expected anything less from Ken Thompson.

https://dl.acm.org/doi/10.1145/358198.358210

132 comments