ANTLR Mega Tutorial

[+] jasode|9 years ago|reply

I last played around with ANTLR in 2012 (when it was version 3) and I discovered that there's a "bigger picture" to the parser generator universe that most tutorials don't talk about:

1) ANTLR is a good tool for generating "happy path" parsers. With a grammar specification, it easily generates a parser that accepts or rejects a piece of source code. However, it's not easy to use the hooks to generate high quality diagnostic error messages.

2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.

The common theme to the 2 bullet points above is that a high quality compiler written by hand will hold multiple "states" of information and an ANTLR grammar file doesn't really have an obvious way to express that knowledge. A pathological example would be the numerous "broken HTML" pages being successfully parsed by browsers. It would be very hard to replicate how Chrome/Firefox/Safari/IE doesn't choke on broken HTML by using ANTLR to generate an HTML parser.

In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.

Lastly, the lexing (creating the tokens) and parsing (creating the AST) is a very tiny percentage of the total development of a quality compiler. Therefore, ANTLR doesn't save as much time as one might think.

I welcome any comments about v4 that makes those findings obsolete.

[+] corndoge|9 years ago|reply

The vast majority of the time it is totally okay to quit on a single syntax error. Not everybody -- most -- do not need "speculative parsing", we are not all designing HTML parsers and IDE assists. ANTLR / Bison are fantastic for parsing grammars and parsers written in them are a million times more maintainable than a 1000 line hand rolled parser in C that's ten years old and has been touched by 20 hands. You really can't out perform a parser generator unless you live in an ivory tower where no one will ever modify your code and you have unlimited time to reinvent every wheel that Bison / ANTLR have built in.

In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.

What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).

[+] Matthias247|9 years ago|reply

I guess ANTLR should have some support for "outside of the happy path parsing": As far as I know it's also used as the parsing part of the XText framework, which allows to create custom languages and which will also create eclipse plugins for editing these. These plugins also need to be error tolerant for providing intellisense support during editing, and at least from a user point of view that seems to work.

[+] bd82|9 years ago|reply

> 2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.

ANTLR does have some generic error recovery capabilities such as single token insertion/deletion or re-sync recovery.

There is a whole chapter describing the algorithms in the ANTLR book: http://media.pragprog.com/titles/tpantlr/errors.pdf (unfortunately this link does not contain the full content of the chapter).

These kinds of generic heuristics is not something that would be simple to implement in most hand build parsers as they require full knowledge of the entire grammar and that information is often "hard coded" as source code not as a separate data structure.

[+] _pmf_|9 years ago|reply

ANTLR has one of the best error handling strategies of the tools I've used; what do you use that is better in this regard (maybe I need to expand my horizon a bit)?

[+] edko|9 years ago|reply

I tried some of the earlier versions of the JS target (v4). Like you say: the happy path works well. What I disliked is that, in unpredictable cases, there were nulls in the generated output. This made me feel doubtful about using it, so I wrote my own parser by hand. It turned out to be a good decision: it was fun, and I learned a lot.

[+] pjmlp|9 years ago|reply

Even so, ANTLR is a great productivity improvement (as other similar compiler generator tools), over those stuck on yacc/lex days.

Also many languages don't survive the prototyping phase into some kind of product, so why bother with premature optimizations?

[+] cityhall|9 years ago|reply

This was my experience too. Did you find any standard cost-based robust parsers or are we still stuck rolling our own?

[+] reacweb|9 years ago|reply

I think most of the creators of quality compilers have written huge numbers of toys (or experimental) compilers before.

[+] CalChris|9 years ago|reply

I switched over to ANTLR 4. It is strictly superior to ANTLR 3. The listener approach rather than embedding code in the grammar is very natural. Separating leads to clean grammars and clean action code. Odd thing is that I was stuck on 3 because 4 didn't support C yet and then I just switched to the Java target in an anti-C pique. Shoulda done that awhile ago.

TParr's The Definitive ANTLR 4 Reference is quite good. And so's this mega tutorial.

https://pragprog.com/book/tpantlr2/the-definitive-antlr-4-re...

ANTLR is my goto tool for DSLs.

[+] n00b101|9 years ago|reply

ANTLR4 has a C++ target now (finally)

[+] ftomassetti|9 years ago|reply

agreed! ANTLR 4 is so much better: writing rules with left recursion is much easier

[+] ttd|9 years ago|reply

I think everyone should manually implement a simple recursive descent parser at least once in their careers. It's surprisingly easy, and really (in my experience) helps to break through the mental barrier of parsers being magical black boxes.

Plus, once you have an understanding of recursive descent parsing, it's a relatively small leap to recursive descent code generation. And once you're there, you have a pretty good high-level understanding of the entire compilation pipeline (minus optimization).

Then all of a sudden, compilers are a whole lot less impenetrable.

[+] Terr_|9 years ago|reply

On that note, Parr also has a nice little book called "Language Implementation Patterns" [0] which introduces things on a level of "this kind of language motif maps to this kind of parsing code".

[0] https://pragprog.com/book/tpdsl/language-implementation-patt...

[+] raverbashing|9 years ago|reply

Just a note on "Why not to use regular expression". Because it's impossible depending on the language complexity

REs are level 3 https://en.wikipedia.org/wiki/Chomsky_hierarchy

[+] exDM69|9 years ago|reply

Classic regular expressions are level 3 in the Chomsky classification, but most regex implementations have features that aren't a part of "regular languages".

The upside is that they are more expressive, but the downside is that it's no longer O(n) to parse a string.

Worst case for some regexes can be exponential, never run untrusted regexes on your server unless you use a safe implementation like re2 which was developed for Google Code Search.

[+] base698|9 years ago|reply

I always thought a hilarious interview question was, "what's regular about a regular expression?"

Similar to, "What's Cascading about Cascading Style Sheets?"

[+] farresito|9 years ago|reply

How is it usually done in most high performance compilers, if I may ask?

[+] ftomassetti|9 years ago|reply

That is also a good reason not to, but unfortunately that escape to most people

[+] nradov|9 years ago|reply

I used ANTLR to write a fuzz testing tool which parses an ABNF grammar (like in an IETF RFC) and then generates random output which matches the grammar. Worked great!

https://github.com/nradov/abnffuzzer

[+] parrt|9 years ago|reply

Heh, that's cool. I tried to do a random phrase generator at one point but it's hard!

[+] pjmlp|9 years ago|reply

Great tutorial, ANTLR is one of the best tools for prototyping languages and compilers.

I wasn't aware it supports JavaScript nowadays.

In any case, good selection of languages.

[+] ftomassetti|9 years ago|reply

Thanks! Yes, it is a big pro the fact it supports so many languages, until you need to write semantic predicates at least

[+] intrasight|9 years ago|reply

For .Net projects, I've used Irony. From the CodePlex site:

"Unlike most existing yacc/lex-style solutions Irony does not employ any scanner or parser code generation from grammar specifications written in a specialized meta-language. In Irony the target language grammar is coded directly in c# using operator overloading to express grammar constructs. "

[+] Matthias247|9 years ago|reply

I used Irony in a very succesful project too. It worked beautifully and got it's job done. The downside is that there is less documentation available then for Antlr.

[+] corndoge|9 years ago|reply

BNF (even with bison extensions) is hardly a specialized meta-language.

[+] musesum|9 years ago|reply

I used Antlr v3 to create a NLP parser for calendar events for iOS and Android. It took longer than expected. iOS + C runtime was opaque, so had to write a tool for debugging. Android + Java runtime overran memory, so had to break into separate grammars. Of course, NLP is not a natural fit. Don't know what problems are fixed by v4.

> The most obvious is the lack of recursion: you can’t find a (regular) expression inside another one ...

PCRE has some recursion. Here is an example for parsing anything between { }, with counting of inner brackets:

'(?>\{(?:[^{}]|(?R))\})|\w+'

A C++11 constexpr can make hand coded parsers a lot more readible, allowing token names in case statements. For example , search on "str2int" in the following island parser: https://github.com/musesum/par

[+] betenoire|9 years ago|reply

These types of tutorials always start out explaining the problems with Regular Expressions and why not to use them... then immediately proceeding into lexing via regular expressions.

Perhaps the tutorials should start with the strengths of regular expressions, and how we can harness that for getting started with a lexer.

[+] destructaball|9 years ago|reply

What are the advantages of ANTLR over something like Haskells Parsec?

https://github.com/aslatter/parsec

[+] danidiaz|9 years ago|reply

Speaking of "parsec", "megaparsec" is a modern fork with a few more bells and whistles: http://hackage.haskell.org/package/megaparsec

[+] pjmlp|9 years ago|reply

Graphical tooling, including grammar debuggers and tree walkers and support for more compiler phases as just parsing.

[+] closed|9 years ago|reply

This tutorial looks great. I picked up Antlr4 a few months ago, and hadn't done any parsing before then. The first week was basically me, The Definitive Antlr4 Reference, and extreme confusion with how different targets worked. Compounding the problem was the fact that a lot of the antlr4 example grammars only work for a specific target. The use of different language implementations as part of this tutorial seems really useful!

(Antlr4 is awesome :)

[+] poppingtonic|9 years ago|reply

I used ANTLR to write a Python parser for the SNOMED expression language last year, and testing it was one of the weirder parts of the experience. I was up and running in a few days, which was largely thanks to the ANTLR book. I love this project. It made doing what I did a lot more fun than I thought it would be. Hand-rolling an ABNF parser from scratch would be a nice hobby project, but not when one has a deadline.

65 comments