Big choices are handrolled recursive decent vs LALR, probably backed by bison or lemon generator and re2c for a lexer.
Passing the lalr(1) check, i.e. having bison actually accept the grammar without complain about ambiguities, is either very annoying or requires thinking clearly about your language, depending on your perspective.
I claim that a lot of the misfires in language implementations are from not doing that work, and using a hand rolled approximation to the parser you had in mind instead, because that's nicer/easier than the formal grammar.
The parser generators emit useless error messages, yes. So if you want nice user feedback, that'll be handrolled in some fashion. Sure.
Sometimes people write a grammar and use a hand rolled parser, hoping they match. Maybe with tests.
The right answer, used by noone as far as I can tell, is to parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics. Never feed the parse tree from the hand rolled parser into the rest of the compiler, that way lies all the bugs.
As alternative phrasing, your linter and your parser don't need to be the same tool, even if it's convenient in some senses to mash them together.
> parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics
This feels like a recipe for disaster. If the hand-rolled parser won't match a formal grammar, why would it match the generated parser?
The poor programmer will be debugging the wrong thing.
It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.
> If I was routinely working in a language that had a well respected de facto standard parser generator and lexer, and regularly building parsers for little languages for my programs, it would probably be worth mastering these tools.
In OCaml, a language highly suited for developing languages in, that de facto standard is the Menhir LR parser generator. It's a modern Yacc with many convenient features, including combinator-like library functions. I honestly enjoy the work of mastering Menhir, poring over the manual, which is all one page: https://gallium.inria.fr/~fpottier/menhir/manual.html
I gave up on Menhir after I understood how allocation-heavy it is during the hot path, at least in the incremental API which is needed for proper errors; and how much of a giant hack you need to force extra lookahead, which shouldn't be such a big deal for parser generators.
These days I just handroll recursive descent parsers with a mutable stream record, `raise_notrace` and maybe some combinators inspired by FParsec for choices, repetition and error messages. I know it's not as rigorous, but at least it's regular code without unexpected limitations.
I wonder who it is that likes other kinds of parser. Over the last ~10 years or so I've read several articles arguing that recursive descent parsers are in fact great on HN. And they seem to be both the easiest to get started with and what almost all production-grade systems use. I've seen very little in the way of anything arguing for any other approaches.
Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given). I am assuming you're willing to put up with the insanity of grammar rewriting, one way or another.
LR however is more powerful, though this mostly matters if you don't have access to automatic grammar rewriting for your LL. More significantly, however, there's probably more good tooling for LR (or perhaps: you can assume that if tooling exists, it is good at what it is designed for); one problem with LL being so "simple" is that there's a lot of bad tooling out there.
The important things are 1. that you meaningfully eliminate ambiguities (which is easy to enforce for LR and doable for LL if your tooling is good), and 2. that you keep linear time complexity. Any parser other than LL/LR should be rejected because it fails at least one of these, and often both.
Within the LL and LR families there are actually quite a few members. SLR(1) is strong enough to be interesting but too weak for anything I would call a "language". LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1) (though note that modern tooling can do an optimistic fallback, avoiding the massive blowups of ancient LR tools). SLL(1) I'm not personally familiar with. X(k), where X is one of {SLL, LL, SLR, LALR, LR} and where k > 1, are not very useful; k=1 suffices. LL(*) however should be avoided due to backtracking, but in some cases consider if you can parsing token trees first (this is currently poorly represented in the literature; you want to be doing some form of this for error recovery anyway - automated error recovery is a useless lie) and/or defer the partial ambiguity until the AST is built (often better for error messages anyway, independent of using token trees).
But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers. Plenty of us still like LR parser generators (see my other comment.)
In between "easiest to get started with" and "what production-grade systems use", there is "easy to actually finish a medium-sized project with." I think LR parsers still defend that middle ground pretty well.
Pratt parsers are really fun if slightly mind-bending, their ability to handle odd associativities, precedences, and arities is basically unmatched making them really useful to embed inside recursive descent for when you reach expressions. If you need infix and mixfix operators anyway.
In terms of language-agnosticism, you can use Bison to calculate the tables (the hard part) and dump an xml file, then implement the machine yourself trivially.
I get really annoyed when people still complain about YACC while ignoring the four decades of practical improvement that Bison has given us if you bother to configure it.
For those to whom they are new: I found
them a little tricky to implement
directly from Pratt's paper or even
Crockford's javascript that popularized
them.
So, through trial and error I figured out how to
actually implement them in regular
languages (i.e. not in Lisp).
A middle ground that I think is sometimes useful is to use an LR parser generator to check if the grammar is ambiguous, but use recursive descent for the actual implementation. Since we won't actually use any code from the LR parser generator, you can pick whatever one you prefer regardless of the programming language.
It's trivial to get a recursive descent parser without any ambiguities hidden in it if you don't go the PEG route (which is only unambiguous because you always pick the first choice, which might not be what you want). Just always branch on the current token. No way to have an ambiguity like that.
I've been having thoughts along these lines. Earley parsers match recursive descent really nicely. In my head there'd by an Earley parser "oracle": you'd tell the oracle about the operations you've performed (in terms of terminal consumption); and, then, you can ask the oracle which recursive descent subfunctions are safe to call (based on the prediction phase).
Agree completely and I’ve used a bunch of them and also functional combinator libraries. I‘d go further and say the recursive descent and Pratt approach is the way if you want to offer useful error messages and feedback to the user. They’re also trivial to debug and test unlike any generation based approach.
To add to your survey, I have been reading the lark documentation https://github.com/lark-parser/lark and like the cut of it's jib, I have not used it yet as I don't really have any projects that need a full parser.
I recently wrote a small C compiler that uses a recursive decent parser while this should not be possible if you just look at the syntax grammar. Why, because it looks at some semantic information about the class of identifiers, whether they are variables of typedefs for example. On the otherhand this is not very surprising, because in the days C was developed, easy parsing was a practical implication of it not being an academic research thing, but something that just had to work.
Recursive decent parsers can simply be implemented with recusive functions. Implementing semantic checks becomes easy with additional parameters.
It sounds like you're describing the Lexer Hack[1]. That trick works just the same in an LR parser, so I wouldn't count it as an advantage of recursive descent.
There is a straightforward technique for writing unambiguous recursive descent parsers. The high level algorithm is this: parsing always consumes one character, never backtracks and is locally unambiguous.
You then construct the parser by combining unambiguous parsers from the bottom up. The result ends up unambiguous by construction.
This high level algorithm is much easier to implement without a global lexer. Global lexing can be a source of inadvertent ambiguity. Strings make this obvious. If instead, you lex in a context specific way, it is usually easy to efficiently eliminate ambiguities.
Funny, I wrote a recursive descent parser in 1982, in Fortran, to parse the syntax of the Jovial programming language. That was my first ever professional programming project, with no university degree in CS, or job experience. Note, Fortran (78) is a terrible language to write a parser in.
I wish I could have save the source. It would be fun to see it.
Recursive descent is a cleaner way to go when the language cannot be lexed without feedback from the parser and semantics, like Fortran. And parser combinators make RD straightforward to code.
Which mainline compilers or runtimes use a generated parser? I know that CRuby does, though they've recently standardized on Prism as their public AST, and it's possible that they'll switch to Prism for parsing eventually. I know that Go used to, as well as ancient versions of GCC.
It seems that, from the outside looking in, ~all significant PL projects end up using a hand-written recursive descent parser, eventually.
recursive descent parsers are usually what I do for my little domain specific scripting languages. They are just easy and straightforward. I do like things like ANTLR, but most of the time it seems unnecessary.
I have heard that RDP is prominent in production parsers, I wonder is it true? And is it pure handwritten RDP or combined with other automated techniques?
One reason hand-written recursive-descent parsers are common is because a lot of languages are poorly designed, and it's easier to hack around the mistakes in a hand-written parser.
For new languages this should be avoided - just design a sane grammar in the first place.
Agreed. I recently had Gemini write a recursive descent parser for a specified subset of C in C and it did quite well. I've tried similar with Claude 4 and Qwen3 Coder and again, both did quite well.
Have people heard of the following top-down parsing algorithm for mathematical expressions:
1. Replace any expression that's within parentheses by its parse tree by using recursion
2. Find the lowest precedence operator, breaking ties however you'd like. Call this lowest precedence operator OP.
3. View the whole unparsed expression as `x OP y`
4. Generate a parse tree for x and for y. Call them P(x) and P(y).
5. Return ["OP", P(x), P(y)].
It's easy to speed up step 2 by keeping a table of all the operators in an expression, sorted by their precedence levels. For this table to work properly, the positions of all the tokens must never change.
For 2, I don’t think you can break ties however you like because this would give you random left or right associativity https://en.m.wikipedia.org/wiki/Operator_associativity For example 2-4-7 would be either (2-4)-7 or 2-(4-7), depending on how you broke the tie.
[+] [-] JonChesterfield|8 months ago|reply
Big choices are handrolled recursive decent vs LALR, probably backed by bison or lemon generator and re2c for a lexer.
Passing the lalr(1) check, i.e. having bison actually accept the grammar without complain about ambiguities, is either very annoying or requires thinking clearly about your language, depending on your perspective.
I claim that a lot of the misfires in language implementations are from not doing that work, and using a hand rolled approximation to the parser you had in mind instead, because that's nicer/easier than the formal grammar.
The parser generators emit useless error messages, yes. So if you want nice user feedback, that'll be handrolled in some fashion. Sure.
Sometimes people write a grammar and use a hand rolled parser, hoping they match. Maybe with tests.
The right answer, used by noone as far as I can tell, is to parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics. Never feed the parse tree from the hand rolled parser into the rest of the compiler, that way lies all the bugs.
As alternative phrasing, your linter and your parser don't need to be the same tool, even if it's convenient in some senses to mash them together.
[+] [-] mrkeen|8 months ago|reply
This feels like a recipe for disaster. If the hand-rolled parser won't match a formal grammar, why would it match the generated parser?
The poor programmer will be debugging the wrong thing.
It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.
[+] [-] jasperry|8 months ago|reply
In OCaml, a language highly suited for developing languages in, that de facto standard is the Menhir LR parser generator. It's a modern Yacc with many convenient features, including combinator-like library functions. I honestly enjoy the work of mastering Menhir, poring over the manual, which is all one page: https://gallium.inria.fr/~fpottier/menhir/manual.html
[+] [-] debugnik|8 months ago|reply
These days I just handroll recursive descent parsers with a mutable stream record, `raise_notrace` and maybe some combinators inspired by FParsec for choices, repetition and error messages. I know it's not as rigorous, but at least it's regular code without unexpected limitations.
[+] [-] fuzztester|8 months ago|reply
What makes OCaml suited for that?
[+] [-] nicoburns|8 months ago|reply
[+] [-] o11c|8 months ago|reply
LR however is more powerful, though this mostly matters if you don't have access to automatic grammar rewriting for your LL. More significantly, however, there's probably more good tooling for LR (or perhaps: you can assume that if tooling exists, it is good at what it is designed for); one problem with LL being so "simple" is that there's a lot of bad tooling out there.
The important things are 1. that you meaningfully eliminate ambiguities (which is easy to enforce for LR and doable for LL if your tooling is good), and 2. that you keep linear time complexity. Any parser other than LL/LR should be rejected because it fails at least one of these, and often both.
Within the LL and LR families there are actually quite a few members. SLR(1) is strong enough to be interesting but too weak for anything I would call a "language". LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1) (though note that modern tooling can do an optimistic fallback, avoiding the massive blowups of ancient LR tools). SLL(1) I'm not personally familiar with. X(k), where X is one of {SLL, LL, SLR, LALR, LR} and where k > 1, are not very useful; k=1 suffices. LL(*) however should be avoided due to backtracking, but in some cases consider if you can parsing token trees first (this is currently poorly represented in the literature; you want to be doing some form of this for error recovery anyway - automated error recovery is a useless lie) and/or defer the partial ambiguity until the AST is built (often better for error messages anyway, independent of using token trees).
[+] [-] jasperry|8 months ago|reply
In between "easiest to get started with" and "what production-grade systems use", there is "easy to actually finish a medium-sized project with." I think LR parsers still defend that middle ground pretty well.
[+] [-] userbinator|8 months ago|reply
It seems to be mainly academics and others interested in parsing theory, and those who like complexity for the sake of complexity.
[+] [-] masklinn|8 months ago|reply
[+] [-] lenkite|8 months ago|reply
[+] [-] cxr|8 months ago|reply
[+] [-] o11c|8 months ago|reply
I get really annoyed when people still complain about YACC while ignoring the four decades of practical improvement that Bison has given us if you bother to configure it.
[+] [-] randomNumber7|8 months ago|reply
Is is also written in a badass style and argues that this is superior to parser generators.
https://dl.acm.org/doi/pdf/10.1145/512927.512931
[+] [-] pratt4the_win|8 months ago|reply
For those to whom they are new: I found them a little tricky to implement directly from Pratt's paper or even Crockford's javascript that popularized them.
So, through trial and error I figured out how to actually implement them in regular languages (i.e. not in Lisp).
If it helps, examples in C and Go are here:
https://github.com/glycerine/PrattParserInC
https://github.com/glycerine/zygomys/blob/master/zygo/pratt....
I find them easier to work with than the cryptic LALR(1) bison/yacc tools, but then I never really felt like I mastered yacc to begin with.
[+] [-] ivanjermakov|8 months ago|reply
[+] [-] ufo|8 months ago|reply
[+] [-] sirwhinesalot|8 months ago|reply
[+] [-] thechao|8 months ago|reply
[+] [-] marssaxman|8 months ago|reply
[+] [-] derriz|8 months ago|reply
[+] [-] zahlman|8 months ago|reply
https://pypi.org/project/pybison/ , or its predecessors such as https://pypi.org/project/ply/ ?
But yes, the decidedly non-traditional https://github.com/pyparsing/pyparsing/ is certainly more popular.
[+] [-] somat|8 months ago|reply
[+] [-] fjfaase|8 months ago|reply
Recursive decent parsers can simply be implemented with recusive functions. Implementing semantic checks becomes easy with additional parameters.
[+] [-] ufo|8 months ago|reply
[1] https://en.wikipedia.org/wiki/Lexer_hack
[+] [-] WalterBright|8 months ago|reply
What a waste of time. I failed miserably.
However, I also realized that the only semantic information needed was to keep track of typedefs. That made recursive descent practical and effective.
[+] [-] norir|8 months ago|reply
You then construct the parser by combining unambiguous parsers from the bottom up. The result ends up unambiguous by construction.
This high level algorithm is much easier to implement without a global lexer. Global lexing can be a source of inadvertent ambiguity. Strings make this obvious. If instead, you lex in a context specific way, it is usually easy to efficiently eliminate ambiguities.
[+] [-] coldcode|8 months ago|reply
I wish I could have save the source. It would be fun to see it.
[+] [-] pklausler|8 months ago|reply
[+] [-] favorited|8 months ago|reply
It seems that, from the outside looking in, ~all significant PL projects end up using a hand-written recursive descent parser, eventually.
[+] [-] layer8|8 months ago|reply
[+] [-] unknown|8 months ago|reply
[deleted]
[+] [-] keithnz|8 months ago|reply
[+] [-] fuzztester|8 months ago|reply
I am interested in that area, and reading up and learning about it.
[+] [-] deterministic|8 months ago|reply
[+] [-] markus_zhang|8 months ago|reply
[+] [-] o11c|8 months ago|reply
For new languages this should be avoided - just design a sane grammar in the first place.
[+] [-] chadcmulligan|8 months ago|reply
[+] [-] UncleOxidant|8 months ago|reply
[+] [-] ogogmad|8 months ago|reply
[+] [-] da-bacon|8 months ago|reply