top | item 46288291

Rust GCC backend: Why and how

199 points| ahlCVA | 2 months ago |blog.guillaume-gomez.fr

130 comments

order

mastax|2 months ago

> On that note: GCC doesn't provide a nice library to give access to its internals (unlike LLVM). So we have to use libgccjit which, unlike the "jit" ("just in time", meaning compiling sub-parts of the code on the fly, only when needed for performance reasons and often used in script languages like Javascript) part in its name implies, can be used as "aot" ("ahead of time", meaning you compile everything at once, allowing you to spend more time on optimization).

Is libgccjit not “a nice library to give access to its internals?”

compiler-guy|2 months ago

To use an illustrative (but inevitably flawed) metaphor: Using libgccjit for this is a bit like networking two computers via the MIDI protocol.

The MIDI protocol is pretty good for what it is designed for, and you can make it work for actual real networking, but the connections will be clunky, unergonomic, and will be missing useful features that you really want in a networking protocol.

saghm|2 months ago

I could be wrong, but my surface level understanding is that it's more of a library version of the external API of GCC than one that gives access to the internals.

LukeShu|2 months ago

libgccjit is much higher level than what's documented in the "GCC Internals" manual.

keyle|2 months ago

If the author reads this...

I'd be very interested if the author could provide a post with a more in depth view of the passes, as suggested!

petcat|2 months ago

> Little side-note: If enough people are interested by this topic, I can write a (much) longer explanation of these passes.

Yes, please!

grokx|2 months ago

When I studied compiler theory, a large part of the compilation involved a lexical analyser (e.g. `flex`) and a syntax analyser (e.g. `bison`), that would produce an internal representation of the input code (the AST), used to generate the compiled files.

It seems that the terminology as evolved, as we speak more broadly of frontends and backends.

So, I'm wondering if Bison and Flex (or equivalent tools) are still in use by the modern compilers? Or are they built directly in GCC, LLVM, ...?

eslaught|2 months ago

The other answers are great, but let me just add that C++ cannot be parsed with conventional LL/LALR/LR parsers, because the syntax is ambiguous and requires disambiguation via type checking (i.e., there may be multiple parse trees but at most one will type check).

There was some research on parsing C++ with GLR but I don't think it ever made it into production compilers.

Other, more sane languages with unambiguous grammars may still choose to hand-write their parsers for all the reasons mentioned in the sibling comments. However, I would note that, even when using a parsing library, almost every compiler in existence will use its own AST, and not reuse the parse tree generated by the parser library. That's something you would only ever do in a compiler class.

Also I wouldn't say that frontend/backend is an evolution of previous terminology, it's just that parsing is not considered an "interesting" problem by most of the community so the focus has moved elsewhere (from the AST design through optimization and code generation).

umanwizard|2 months ago

"Frontend" as used by mainstream compilers is slightly broader than just lexing/parsing.

In typical modern compilers "frontend" is basically everything involving analyzing the source language and producing a compiler-internal IR, so lexing, parsing, semantic analysis and type checking, etc. And "backend" means everything involving producing machine code from the IR, so optimization and instruction selection.

In the context of Rust, rustc is the frontend (and it is already a very big and complicated Rust program, much more complicated than just a Rust lexer/parser would be), and then LLVM (typically bundled with rustc though some distros package them separately) is the backend (and is another very big and complicated C++ program).

pklausler|2 months ago

Table-driven parsers with custom per-statement tokenizers are still common in surviving Fortran compilers, with the exception of flang-new in LLVM. I used a custom parser combinator library there, inspired by a prototype in Haskell's Parsec, to implement a recursive descent algorithm with backtracking on failure. I'm still happy with the results, especially with the fact that it's all very strongly typed and coupled with the parse tree definition.

brooke2k|2 months ago

Not sure about GCC, but in general there has been a big move away from using parser generators like flex/bison/ANTLR/etc, and towards using handwritten recursive descent parsers. Clang (which is the C/C++ frontend for LLVM) does this, and so does rustc.

jojomodding|2 months ago

This was in the olden days when your language's type system would maybe look like C's if you were serious and be even less of a thing when you were not.

The hard part about compiling Rust is not really parsing, it's the type system including parts like borrow checking, generics, trait solving (which is turing-complete itself), name resolution, drop checking, and of course all of these features interact in fun and often surprising ways. Also macros. Also all the "magic" types in the StdLib that require special compiler support.

This is why e.g. `rustc` has several different intermediate representations. You no longer have "the" AST, you have token trees, HIR, THIR, and MIR, and then that's lowered to LLVM or Cranelift or libgccjit. Each stage has important parts of the type system happen.

astrange|2 months ago

Compiler theory a) doesn't seem to have much to do with production compilers b) is unnecessarily heavyweight and scary about everything.

In particular, it makes parsing everything look like a huge difficult problem. This is my main problem with the Dragon Book.

In practice everyone uses hacky informal recursive-descent parsers because they're the only way to get good error messages.

quamserena|2 months ago

Not really. Here’s a comparison of different languages: https://notes.eatonphil.com/parser-generators-vs-handwritten...

Most roll their own for three reasons: performance, context, and error handling. Bison/Menhir et al. are easy to write a grammar and get started with, but in exchange you get less flexibility overall. It becomes difficult to handle context-sensitive parts, do error recovery, and give the user meaningful errors that describe exactly what’s wrong. Usually if there’s a small syntax error we want to try to tell the user how to fix it instead of just producing “Syntax error”, and that requires being able to fix the input and keep parsing.

Menhir has a new mode where the parser is driven externally; this allows your code to drive the entire thing, which requires a lot more machinery than fire-and-forget but also affords you more flexibility.

peterfirefly|2 months ago

Mostly because that's the part that had the best developed theory so that's what tended to be taught.

The rest of the f*cking owl is the interesting part.

MangoToupe|2 months ago

I find it shocking that 20 years after LLVM was created, gcc still hasn't moved towards modularization of codegen.

pjmlp|2 months ago

LLVM wasn't the first modularization of codegen, see Amsterdam Compiler Kit for prior art, among others.

GCC approach is on purpose, plus even if they wanted to change, who would take the effort to make existing C, C++, Objective-C, Objective-C++, Fortran, Modula-2, Algol 68, Ada, D, and Go frontends adopt the new architecture?

Even clang with all the LLVM modularization is going to take a couple of years to move from plain LLVM IR into MLIR dialect for C based languages, https://github.com/llvm/clangir

ayende|2 months ago

Isn't that very much intentional on the part of GCC?

kunley|2 months ago

A perhaps naive question: does it have a chance to be faster than LLVM backend?

MerrimanInd|2 months ago

Another reason to have a second compiler is for safety-critical applications. In the assessment of safety-critical tools if something like a compiler can have a second redundant version then each one of them can be certified to a lower criticality level since they'll crosscheck each other. When a tool is single-sourced the level of qualification goes up quite significantly.

steveklabnik|2 months ago

rustc (via Ferrocene) is already being qualified, and form what I hear it’s been fairly easy to do so, for various reasons.

1718627440|2 months ago

I don't necessary like the focus on Rust, but if it happens, then we need to have support in the free compiler!

lionkor|2 months ago

Why not? Like what about the technology or ecosystem do you disagree with

ladyanita22|2 months ago

LLVM is also free

pessimizer|2 months ago

Almost the only thing I don't like about Rust is that a bunch of people actively looking to subvert software freedom have set up shop around it. If everything was licensed correctly and designed to resist control by special interests, I'd be a lot happier with having committed to it.

The language itself I find wonderful, and I suspect that it will get significantly better. Being GPL-hostile, centralized without proper namespacing, and having a Microsoft dependency through Github registration is aggravating. When it all goes bad, all the people silencing everyone complaining about it will play dumb.

If there's anything I would want rewritten in something like Rust, it would be an OS kernel.

umanwizard|2 months ago

Rustc (+ LLVM) already is a free compiler.

notepad0x90|2 months ago

I would just like to encourage all Rust devs to distribute binaries. No matter what compiler you choose, or what Rust version, users shouldn't have to build from source. I mostly see this with small projects to be fair.