I last played around with ANTLR in 2012 (when it was version 3) and I discovered that there's a "bigger picture" to the parser generator universe that most tutorials don't talk about:
1) ANTLR is a good tool for generating "happy path" parsers. With a grammar specification, it easily generates a parser that accepts or rejects a piece of source code. However, it's not easy to use the hooks to generate high quality diagnostic error messages.
2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.
The common theme to the 2 bullet points above is that a high quality compiler written by hand will hold multiple "states" of information and an ANTLR grammar file doesn't really have an obvious way to express that knowledge. A pathological example would be the numerous "broken HTML" pages being successfully parsed by browsers. It would be very hard to replicate how Chrome/Firefox/Safari/IE doesn't choke on broken HTML by using ANTLR to generate an HTML parser.
In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.
Lastly, the lexing (creating the tokens) and parsing (creating the AST) is a very tiny percentage of the total development of a quality compiler. Therefore, ANTLR doesn't save as much time as one might think.
I welcome any comments about v4 that makes those findings obsolete.
The vast majority of the time it is totally okay to quit on a single syntax error. Not everybody -- most -- do not need "speculative parsing", we are not all designing HTML parsers and IDE assists. ANTLR / Bison are fantastic for parsing grammars and parsers written in them are a million times more maintainable than a 1000 line hand rolled parser in C that's ten years old and has been touched by 20 hands. You really can't out perform a parser generator unless you live in an ivory tower where no one will ever modify your code and you have unlimited time to reinvent every wheel that Bison / ANTLR have built in.
In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.
What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).
I guess ANTLR should have some support for "outside of the happy path parsing": As far as I know it's also used as the parsing part of the XText framework, which allows to create custom languages and which will also create eclipse plugins for editing these. These plugins also need to be error tolerant for providing intellisense support during editing, and at least from a user point of view that seems to work.
> 2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.
ANTLR does have some generic error recovery capabilities such as single token insertion/deletion or re-sync recovery.
These kinds of generic heuristics is not something that would be simple to implement in most hand build parsers as they require full knowledge of the entire grammar and that information is often "hard coded" as source code not as a separate data structure.
ANTLR has one of the best error handling strategies of the tools I've used; what do you use that is better in this regard (maybe I need to expand my horizon a bit)?
I tried some of the earlier versions of the JS target (v4). Like you say: the happy path works well. What I disliked is that, in unpredictable cases, there were nulls in the generated output. This made me feel doubtful about using it, so I wrote my own parser by hand. It turned out to be a good decision: it was fun, and I learned a lot.
I switched over to ANTLR 4. It is strictly superior to ANTLR 3. The listener approach rather than embedding code in the grammar is very natural. Separating leads to clean grammars and clean action code. Odd thing is that I was stuck on 3 because 4 didn't support C yet and then I just switched to the Java target in an anti-C pique. Shoulda done that awhile ago.
TParr's The Definitive ANTLR 4 Reference is quite good. And so's this mega tutorial.
I think everyone should manually implement a simple recursive descent parser at least once in their careers. It's surprisingly easy, and really (in my experience) helps to break through the mental barrier of parsers being magical black boxes.
Plus, once you have an understanding of recursive descent parsing, it's a relatively small leap to recursive descent code generation. And once you're there, you have a pretty good high-level understanding of the entire compilation pipeline (minus optimization).
Then all of a sudden, compilers are a whole lot less impenetrable.
On that note, Parr also has a nice little book called "Language Implementation Patterns" [0] which introduces things on a level of "this kind of language motif maps to this kind of parsing code".
Classic regular expressions are level 3 in the Chomsky classification, but most regex implementations have features that aren't a part of "regular languages".
The upside is that they are more expressive, but the downside is that it's no longer O(n) to parse a string.
Worst case for some regexes can be exponential, never run untrusted regexes on your server unless you use a safe implementation like re2 which was developed for Google Code Search.
I used ANTLR to write a fuzz testing tool which parses an ABNF grammar (like in an IETF RFC) and then generates random output which matches the grammar. Worked great!
For .Net projects, I've used Irony. From the CodePlex site:
"Unlike most existing yacc/lex-style solutions Irony does not employ any scanner or parser code generation from grammar specifications written in a specialized meta-language. In Irony the target language grammar is coded directly in c# using operator overloading to express grammar constructs. "
I used Irony in a very succesful project too. It worked beautifully and got it's job done. The downside is that there is less documentation available then for Antlr.
I used Antlr v3 to create a NLP parser for calendar events for iOS and Android. It took longer than expected. iOS + C runtime was opaque, so had to write a tool for debugging. Android + Java runtime overran memory, so had to break into separate grammars. Of course, NLP is not a natural fit. Don't know what problems are fixed by v4.
> The most obvious is the lack of recursion: you can’t find a (regular) expression inside another one ...
PCRE has some recursion. Here is an example for parsing anything between { }, with counting of inner brackets:
'(?>\{(?:[^{}]|(?R))\})|\w+'
A C++11 constexpr can make hand coded parsers a lot more readible, allowing token names in case statements. For example , search on "str2int" in the following island parser: https://github.com/musesum/par
These types of tutorials always start out explaining the problems with Regular Expressions and why not to use them... then immediately proceeding into lexing via regular expressions.
Perhaps the tutorials should start with the strengths of regular expressions, and how we can harness that for getting started with a lexer.
This tutorial looks great. I picked up Antlr4 a few months ago, and hadn't done any parsing before then. The first week was basically me, The Definitive Antlr4 Reference, and extreme confusion with how different targets worked. Compounding the problem was the fact that a lot of the antlr4 example grammars only work for a specific target. The use of different language implementations as part of this tutorial seems really useful!
I used ANTLR to write a Python parser for the SNOMED expression language last year, and testing it was one of the weirder parts of the experience. I was up and running in a few days, which was largely thanks to the ANTLR book. I love this project. It made doing what I did a lot more fun than I thought it would be. Hand-rolling an ABNF parser from scratch would be a nice hobby project, but not when one has a deadline.
[+] [-] jasode|9 years ago|reply
1) ANTLR is a good tool for generating "happy path" parsers. With a grammar specification, it easily generates a parser that accepts or rejects a piece of source code. However, it's not easy to use the hooks to generate high quality diagnostic error messages.
2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.
The common theme to the 2 bullet points above is that a high quality compiler written by hand will hold multiple "states" of information and an ANTLR grammar file doesn't really have an obvious way to express that knowledge. A pathological example would be the numerous "broken HTML" pages being successfully parsed by browsers. It would be very hard to replicate how Chrome/Firefox/Safari/IE doesn't choke on broken HTML by using ANTLR to generate an HTML parser.
In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.
Lastly, the lexing (creating the tokens) and parsing (creating the AST) is a very tiny percentage of the total development of a quality compiler. Therefore, ANTLR doesn't save as much time as one might think.
I welcome any comments about v4 that makes those findings obsolete.
[+] [-] corndoge|9 years ago|reply
In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.
What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).
[+] [-] Matthias247|9 years ago|reply
[+] [-] bd82|9 years ago|reply
ANTLR does have some generic error recovery capabilities such as single token insertion/deletion or re-sync recovery.
There is a whole chapter describing the algorithms in the ANTLR book: http://media.pragprog.com/titles/tpantlr/errors.pdf (unfortunately this link does not contain the full content of the chapter).
These kinds of generic heuristics is not something that would be simple to implement in most hand build parsers as they require full knowledge of the entire grammar and that information is often "hard coded" as source code not as a separate data structure.
[+] [-] _pmf_|9 years ago|reply
[+] [-] edko|9 years ago|reply
[+] [-] pjmlp|9 years ago|reply
Also many languages don't survive the prototyping phase into some kind of product, so why bother with premature optimizations?
[+] [-] cityhall|9 years ago|reply
[+] [-] reacweb|9 years ago|reply
[+] [-] CalChris|9 years ago|reply
TParr's The Definitive ANTLR 4 Reference is quite good. And so's this mega tutorial.
https://pragprog.com/book/tpantlr2/the-definitive-antlr-4-re...
ANTLR is my goto tool for DSLs.
[+] [-] n00b101|9 years ago|reply
[+] [-] ftomassetti|9 years ago|reply
[+] [-] ttd|9 years ago|reply
Plus, once you have an understanding of recursive descent parsing, it's a relatively small leap to recursive descent code generation. And once you're there, you have a pretty good high-level understanding of the entire compilation pipeline (minus optimization).
Then all of a sudden, compilers are a whole lot less impenetrable.
[+] [-] Terr_|9 years ago|reply
[0] https://pragprog.com/book/tpdsl/language-implementation-patt...
[+] [-] raverbashing|9 years ago|reply
REs are level 3 https://en.wikipedia.org/wiki/Chomsky_hierarchy
[+] [-] exDM69|9 years ago|reply
The upside is that they are more expressive, but the downside is that it's no longer O(n) to parse a string.
Worst case for some regexes can be exponential, never run untrusted regexes on your server unless you use a safe implementation like re2 which was developed for Google Code Search.
[+] [-] base698|9 years ago|reply
Similar to, "What's Cascading about Cascading Style Sheets?"
[+] [-] farresito|9 years ago|reply
[+] [-] ftomassetti|9 years ago|reply
[+] [-] nradov|9 years ago|reply
https://github.com/nradov/abnffuzzer
[+] [-] parrt|9 years ago|reply
[+] [-] pjmlp|9 years ago|reply
I wasn't aware it supports JavaScript nowadays.
In any case, good selection of languages.
[+] [-] ftomassetti|9 years ago|reply
[+] [-] intrasight|9 years ago|reply
"Unlike most existing yacc/lex-style solutions Irony does not employ any scanner or parser code generation from grammar specifications written in a specialized meta-language. In Irony the target language grammar is coded directly in c# using operator overloading to express grammar constructs. "
[+] [-] Matthias247|9 years ago|reply
[+] [-] corndoge|9 years ago|reply
[+] [-] musesum|9 years ago|reply
> The most obvious is the lack of recursion: you can’t find a (regular) expression inside another one ...
PCRE has some recursion. Here is an example for parsing anything between { }, with counting of inner brackets:
'(?>\{(?:[^{}]|(?R))\})|\w+'
A C++11 constexpr can make hand coded parsers a lot more readible, allowing token names in case statements. For example , search on "str2int" in the following island parser: https://github.com/musesum/par
[+] [-] betenoire|9 years ago|reply
Perhaps the tutorials should start with the strengths of regular expressions, and how we can harness that for getting started with a lexer.
[+] [-] destructaball|9 years ago|reply
https://github.com/aslatter/parsec
[+] [-] danidiaz|9 years ago|reply
[+] [-] pjmlp|9 years ago|reply
[+] [-] closed|9 years ago|reply
(Antlr4 is awesome :)
[+] [-] poppingtonic|9 years ago|reply