top | item 19633696

Ask HN: How did you learn parsing?

2 points| vector_spaces | 7 years ago | reply

Most programming language implementation/compiler/interpreter books I know of (EOPL, SICP, PAIP, Dragon Book) seem to sort of gloss over parsing, if they cover it at all, or they use some hack like the homoiconicity of Lisp which isn't super generalizable. PAIP goes so far as to call parsing a "distraction"!!

Or in the case of the Dragon Book, they present it in a way that is very detailed but isn't super helpful for someone who's self-teaching (but might be great for someone in a compilers class with a professor to provide context and focus).

I can hack together some parsing code for the most part if I need it, but I'd like to learn how to do it "properly".

I'm a self-taught working programmer (i.e. no degree in CS), and would appreciate responses from other self-taught people who've learned to write LALR/CLR/Pratt/etc parsers for fun or profit.

What resources or projects have helped you here? Any suggestions for someone in my position?

7 comments

[+] smitchell2|7 years ago|reply

I would suggest Recursive Descent for general parsing, and Precedence Climbing for expressions.

Once you get the hang of Recursive Descent, parsing becomes mechanical. And with Precedence Climbing for expressions, you just plug in the appropriate operator priorities.

Some of the literature below refers to "parsing" as "syntax analysis".

See this introductory set of articles:

https://www.semware.com/html/compiler.html

Then see this (small, 160 page) book from Wirth:

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

See examples here - especially the C and Python examples: (lexical analysis examples included, since it must feed the parser)

http://rosettacode.org/wiki/Compiler/lexical_analyzer

http://rosettacode.org/wiki/Compiler/syntax_analyzer

[+] kazinator|7 years ago|reply

The "Dragon Book" glosses over parsing? Yagoddabekidding.

It teaches how to write LALR(1) parser generators, with detailed expositions of the algorithms including step by step execution of shift-reduce parsing. There is a chapter on implementing regular expressions. It has examples of the use of the Lex and Yacc tools.

[+] vector_spaces|7 years ago|reply

My bad, I wasn't being very clear. What I meant with my second sentence was that it's thorough to the point of being a bit overwhelming for someone who's self-teaching. That said, I should probably revisit it and with a more recent edition.

Thanks for the response

[+] tannhaeuser|7 years ago|reply

The dragon book doesn't "gloss over" parsing at all, and is how I learned to create practical parsers by hand (recursive-descent and simple bottom-up parsers) or using lex/yacc. I'd also recommend learning Prolog for easy DCG parsers and extensions (DCTG/attribute grammars).

[+] PaulHoule|7 years ago|reply

The hard way, over a long period of time. (Decades!)

I started out writing parsing code by hand, eventually worked out that parsing email headers was easy with a finite state machine, got into parser generators when I needed to, eventually absorbed a lot of the theory, always motivated by projects that needed the work done.

If you find the subject hard to learn don't just blame yourself, blame the subject. The theory and practice of parsing in 2019 is a lot less developed than it really should be.

(1) Small changes in grammar should be small changes in code. For instance, if you want to add an "unless" keyword to a language which is the opposite of "if" that really should be a few lines of code that are physically separate from the rest of the grammar.

(2) Grammars should be composable: you should be able to make a language that is like Java, except you can write something like

sql(select count(*) from customers)

and be able to do it by referencing existing grammars and just a few lines of code.

(3) Parsers should generate idiomatic data structures. If you are working in Java you should get ordinary objects, maybe in Javascript or Python get JSON-like structures. Small changes in the grammar (say operator precedence) should involve small changes in the parse tree.

(4) When you make a parser you should get an "unparser" for free that will convert the parse tree back to text

(5) Compiler books will talk about how making error messages that make sense is a hard problem and stop there. I've seen more than one system at this point hit the wall (e.g. Drools) because users could not figure out how to decode the error messages.

I think Parsing Expression Grammars are a step in the right direction, but I've yet to see one that ticks all the boxes above.

I find the old parsers for C written by the Bell Labs people interesting because of the hacks involved. Namely, a simple mechanism for operator precedence (try doing that for PEGs!) and a hackish level of dependence between the early stage parser and the symbol table. Finally the "callback hell" interface of yacc was a great idea at the time but is insane in modern application languages.

[+] nonsapreiche|7 years ago|reply

For me started learning lexx and yacc (and the red dragon)

[+] sethgecko|7 years ago|reply

Check out http://norvig.com/lispy2.html