top | item 46215312

(no title)

statictype | 2 months ago

This article sums it up perfectly. I was interested in building a compiler long before going to college and this was the most accessible body of work.

Building a recursive descent parser from scratch was an eye opener to 17yo me on how a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.

discuss

order

fuzztester|2 months ago

>a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.

https://en.wikipedia.org/wiki/Niklaus_Wirth

From the Publications section of that Wikipedia page:

>The April 1971 Communications of the ACM article "Program Development by Stepwise Refinement",[22][23] concerning the teaching of programming, is considered to be a classic text in software engineering.[24] The paper is considered to be the earliest work to formally outline the top-down method for designing programs.[25][26] The article was discussed by Fred Brooks in his influential book The Mythical Man-Month and was described as "seminal" in the ACM's brief biography of Wirth published in connection to his Turing Award.[27][28]

microtherion|2 months ago

Wirth also wrote an extremely accessible book on Compiler Construction, using exactly the hand written recursive descent parsing approach discussed by OP.

The initial edition was published in 1976, in German, but the latest version is available online:

https://people.inf.ethz.ch/wirth/CompilerConstruction/Compil...

There are also parser generators like ANTLR (https://en.wikipedia.org/wiki/ANTLR) which take an input not unlike yacc, but generate a LL parser using explicit code, rather than the table driven LALR parsing of yacc.

xyproto|2 months ago

"breaking things down into the right primitives" is the real key to programming. There are many books and web pages about algorithms, but I wish there were more searchable and browsable resources for how to approach problems through primitives.

zwnow|2 months ago

The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.

Whats blocking me during programming usually are edge cases I had no idea about. Its still hard to find good material on compilers if you are not into reading dry ass books. Thats a me problem though, I simply cant force myself to read boring factual only content (one of the reasons as to why I love beejs guides).

scotty79|2 months ago

When I need to parse something nowadays I always end up with parser combinators. They just make so much sense.

mutkach|2 months ago

What language do you use parser combinators in, and what kind of grammar do you parse usually? Nom was terribly verbose and unergonomic even by Rust's standards. Haskell's Megaparsec/Parsec is good but yeah, it's Haskell, you need to handle multiple monads (Parser itself is monadic, then your AST state, and maybe some error handling) at once and that's where I got confused. But I appreciated the elegance.

I experimented with PCs in Haskell and Rust (nom), then moved on to parser generators in Rust (pest.rs), Ocaml (Menhir), Haskell (Happy) and finally ended up with python's Lark - the speed of experimenting with different syntax/grammars is just insane.

userbinator|2 months ago

That's just a generalisation of recursive descent.

HarHarVeryFunny|2 months ago

That's a good point - recursive descent as a general lesson in program design, in addition to being a good way to write a parser.

Table driven parsers (using yacc/etc) used to be emphasized in old compiler writing books such as Aho & Ullman's famous "dragon (front cover) book". I'm not sure why - maybe part efficiency for the slower computers of the day, and part because in the infancy of computing a more theoretical/algorithmic approach seemed more sophisticated and preferable (the cannonical table driven parser building algorithm was one of Knuth's algorithms).

Nowadays it seems that recursive descent is the preferred approach for compilers because it's ultimately more practical and flexible. Table driven can still be a good option for small DSLs and simple parsing tasks, but recursive descent is so easy that it's hard to justify anything else, and LLM code generation now makes that truer than ever!

There is a huge difference in complexity between building a full-blown commercial quality optimizing compiler and a toy one built as a learning exercise. Using something like LLVM as a starting point for a learning exercise doesn't seem very useful (unless your goal is to build real compilers) since it's doing all the heavy lifting for you.

I guess you can argue about how much can be cut out of a toy compiler for it still to be a useful learning exercise in both compilers and tackling complex problems, but I don't see any harm in going straight from parsing to code generation, cutting out AST building and of course any IR and optimization. The problems this direct approach causes for code generation, and optimization, can be a learning lesson for why a non-toy compiler uses those!

A fun approach I used at work once, wanting to support a pretty major C subset as the language supported by a programmable regression test tool, was even simpler ... Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object. The recursive descent functions just returned these statement or expression values directly. To give a flavor of it, the ForLoopStatement subclass held the initialization, test and increment expression class pointers, and then the ForLoopStatement::Execute() method could just call testExpression->Value() etc.

userbinator|2 months ago

Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object.

That is basically an AST interpreter.