top | item 8505382

Parsing: The Solved Problem That Isn't (2011)

97 points| wslh | 11 years ago |tratt.net | reply

70 comments

order
[+] schoen|11 years ago|reply
It may also relevant to mention the language-theoretic security research program ("LANGSEC").

http://langsec.org/

They've pointed out that the difficulty of parsing, and in a sense our overconfidence that we can just code up parsers for random languages and input formats when we need them, is a pretty pervasive source of security bugs.

A lot of those bugs can occur when you have two different parsers that have a different notion of what language they're supposed to recognize, so it's possible to construct an input whose meaning the two parsers disagree on. That can have pretty serious ramifications if, for example, the first parser is deciding whether a requested action is authorized and the second parser is carrying out the action!

I'm kind of sad about this because I love whipping up regular expressions to extract data even from things that regular expressions technically can't handle correctly. But there's a good argument to be made that this habit is playing with fire much of the time, at least in systems that will end up handling untrusted input. And the Shellshock bug is a recent example of the way that your intuitions about whether your software will "handle untrusted input" in some use case can go out of date.

[+] dkarapetyan|11 years ago|reply
See also this paper: http://www.ieee-security.org/TC/SPW2014/papers/5103a198.PDF. The authors implement a pdf file format parser and find bugs in pretty much all of the existing implementations. Basically pdf is a pretty shitty file format with several ill-defined corner cases. This is one of the reasons PDFs tend to be vectors for security breaches.
[+] girvo|11 years ago|reply
This is probably a good place to ask; I've wanted to build a language myself -- whats the best place to begin learning about parsers and the like? About a decade ago I asked this question and was told to read the "Dragon book" but I was far too young and lacked experience. Now I really want to get stuck into something outside of my day-to-day web stuff.
[+] dkarapetyan|11 years ago|reply
Start here: http://nathansuniversity.com/. All the stuff in the dragon book is designed for limited memory and limited computing capability environments. There is no reason to worry about LL(k) parse table size or predict and follow sets when you don't have to. For most practical purposes you can get away with basic recursive descent or PEG parsers. The link I pointed you to starts with peg.js which is PEG parser generator in JavaScript.

If you get through PL101 then picking up the stuff in the dragon book or any other book on parsing and compiling technology will be much easier.

Another resource I like is "Compiler Design: Virtual Machines" (http://smile.amazon.com/Compiler-Design-Machines-Reinhard-Wi...). Still going through that one but it is very readable and if you go through PL101 then you'll have all the tools to implement the virtual machines described in that book. It is much easier to write a compiler to target machine code or some other language like C when you've built a few targets yourself and understand the trade-offs involved.

There's also http://www.greatcodeclub.com/. I think one of the projects is a simple virtual machine and another one is a compiler. Well worth the admission price if you're a beginner and want some help getting started.

Hanselman's rule about finiteness of keystrokes applies and I recently wrote some notes about budding PL enthusiasts: http://www.scriptcrafty.com/tips-for-the-budding-pl-enthusia....

[+] jbzz|11 years ago|reply
Parsing Techniques by Dick Grune is THE bible for parsing.
[+] keithrel|11 years ago|reply
Just start writing the parser/compiler/interpreter for your language, today. You will quickly learn what you need, tokenizer/lexer, some way to represent the AST, how to represent types (if you have any), etc. Later start looking at the literature to see how "its really done". I find this approach takes more time perhaps but I have a much better appreciation for what I am reading afterwards (Oh that's why they did X!).
[+] Dewie|11 years ago|reply
The dragon book (at least the first edition) has 330 pages on parsing. 330 pages. IMO that is excessive for a beginner, given how relatively easy it is to implement a parser for a fairly standard language.
[+] falsedan|11 years ago|reply
Jeffrey Kegler's work on Marpa is pretty exciting, but hasn't got much traction (maybe due to the implementation languages: Perl and Knuth's Literate Programming).

https://metacpan.org/pod/Marpa::R2#A-simple-calculator

[+] sklogic|11 years ago|reply
How exactly "PEGs are rather inexpressive"? Still the same BNF, with some nice bells and whistles added. As for the left recursion, it's not a big deal. You mostly need left recursion for the binary expressions, and they are much better served by a Pratt parsing (which is trivial to mix with Packrat anyway).

I moved to PEG+Pratt exclusively and never needed anything beyond that, for even craziest grammars imaginable.

[+] jbangert|11 years ago|reply
One issue with PEG's (and other parsers) is that it doesn't address (unbounded) count fields (or bounded count fields in an elegant manner) or offsets. This means a pure PEG can't express e.g. PDF or ZIP files. To address this, we built a PEG-based parser generator with a few new features, Nail (paper at OSDI 14, github.com/jbangert/nail)
[+] c3d|11 years ago|reply
The XL programming language (http://xlr.sf.net) has a rather unique approach to parsing. There is a short article about it here: http://grenouille-bouillie.blogspot.fr/2010/06/xl-axioms-rec....

XL features 8 simple node types, 4 leafs (integer, real, text, name/symbol) and 4 inner nodes (infix, prefix, postfix and block). With that, you can use a rather standard looking syntax, yet have an inner parse tree structure that is practically as simple as Lisp. The scanner and parser together represent 1800 lines of C++ code with comments. In other words, with such a short parser, you have a language that reads like Python but has an abstract syntax tree that is barely more complicated than Lisp.

It's also the basis for the Tao 3D document description language used at Taodyne, so the approach has demonstrated its ability to work in an industrial project.

[+] breuleux|11 years ago|reply
I designed the Earl Grey language (http://breuleux.github.io/earl-grey/repl/) on pretty much exactly the same principle, it's pretty interesting to see someone else had the same idea. Operator precedence grammars are surprisingly powerful and pretty simple to write (you can write a working parser in ~100 lines of Python or JavaScript, including support for multifix).

I did simplify the AST further than XL and added a few invariants. I have 3 leafs (literal, symbol, void) and 3 inner (send, multi, data). The inners roughly work like this:

    f x             (send f x)
    [x]             x
    [x, y, ...]     (multi x y ...)
    {x, y, ...}     (data x y ...)
    a + b           (send + (data a b))
    + a             (send + (data (void) a))
    a +             (send + (data a (void)))
[] serve as grouping brackets and are therefore purposefully erased from the parse tree if they only contain one expression. Prefix/postfix operators are interpreted as infix operators that have a blank operand on either side. Infix operators are themselves reduced to send/data, so "a + b" is equivalent to "[+]{a, b}". I find that a bit more robust and easier to manipulate generically.
[+] justinpombrio|11 years ago|reply
Judging from the article, if you parsed "if 3", you'd get back out

    (prefix if 3)
instead of an error, since `if` is just a regular prefix operator. So, presumably there's some sort of well-formedness checking that goes on after parsing? Does anyone know more? The website says very little.
[+] wslh|11 years ago|reply
I just found this "old" article looking for an easy to use parser in C++ (in Visual Studio, GCC, and CLang). I think the current state of parsers in C++ is... how to say it... terrible! flex and bison doesn't look like C++, Boost Spirit too much ado, ANTLR4 doesn't support it yet and setting up ANTLR3 in Visual Studio can be explained in a few steps if it were explained in a straightforward way.

When you look at the simplicity of OMeta[1] you feel the difference. But I am not aware of any (production ready) OMeta implementation for C++.

[1] http://tinlizzie.org/ometa/

[+] stormbrew|11 years ago|reply
If you don't mind incredibly long compile times for any moderately complex grammar, PEGTL is pretty straightforward compared to other alternatives.
[+] parrt|11 years ago|reply
I should add Adaptive LL(* ), ALL(* ), of ANTLR 4 to the mix here. It handles any grammar you give it and generates a correct parser except for one small caveat: no indirect left-recursion. It's the culimation of 25 years of focused effort to take LL-based parsing to the limit of power while maintaining simplicity and efficiency. See tool shootout in OOPSLA '14 paper I just presented http://www.antlr.org/papers/allstar-techreport.pdf Until we change the requirements of a parser generator, I'm done. :)
[+] logicalshift|11 years ago|reply
My day job used to involve parsing some fairly ill-formed languages, so I developed this: https://github.com/Logicalshift/TameParse

I noticed with the languages I was working on, the problems could be resolved by being smarter with the lookahead: this parser allows for context-free lookahead matching to resolve (or detect and defer) ambiguities.

That makes it possible to do neat things like parse C snippets without full type information or deal with keywords that aren't always keywords (eg, await in C#).

[+] jblow|11 years ago|reply
It is nice to see someone summarizing this kind of information. However, really this is a continuation of the academic attitude toward parsers making them MUCH harder than they have to be.

If you want to study grammars in an abstract sense, then think of them this way, and that's fine. If you want to build a parser for a programming language, don't use any of this stuff. Just write code to parse the language in a straightforward way. You'll get a lot more done and the resulting system will be much nicer for you and your users.

[+] chriswarbo|11 years ago|reply
> Just write code to parse the language in a straightforward way.

This approach is why many consider parsing to be a solved problem, so it's certainly a valid approach. However, it's not the only valid approach.

For example, "straightforward" parsers often give terrible error messages: when the intended branch (eg. if/then/else) fails, the parser will backtrack and try a more general alternative (eg. a function call). Not only does this give an incorrect error (eg. "no such function 'esle'"), but it might actually succeed! In which case, the parser will be in the wrong state to parse the following text, and gives a non-sensical message (eg. "unexpected '('" several lines later).

This is an important problem, since these messages can only be decyphered by those who know enough about the syntax to avoid hitting them very often! Inexperienced users will see error messages over and over again, and have no idea that they're being asked to fix non-existent errors in incorrect positions.

[+] wodenokoto|11 years ago|reply
That's horrible advice. Some problems are really complex and can't be solved just by working from A to B.

Your parser will probably not end up being able to handle the problems outlined in the article unless you take that theory into account before starting to program.

[+] lvh|11 years ago|reply
Which tools in particular do you feel are acceptable? "Just writing the code to parse the language" is almost definitional of shotgun parsers, so it sounds like you are advocating an ad-hoc parser (probably something recursive descent-ish, or perhaps with even less structure than that), but I wouldn't want to put words in your mouth.

ADDENDUM: In particular, the article mentions PEGs. Not in an unambiguously positive light, but nonetheless: do you count PEGs as being a useless overly academic tool?

The LANGSEC project has done (in my opinion) a pretty good job arguing that this behavior causes many security issues in software. Perhaps only to some extent for programming languages in general, but in particular for data format languages and network protocols. The issue is slightly mitigated for programming languages because they typically don't have many wildly varying parsers; but (programming) languages without clearly specified grammars certainly have had problems with this in the past.

Are you familiar with their work? Do you just disagree with their conclusions?

[LANGSEC]: http://langsec.org

[+] l_dopa|11 years ago|reply
This approach might be why several ubiquitous languages have needlessly ambiguous grammars. Decades of writing tools to be bug-compatible with the original implementation is much harder than learning a tiny bit of theory.
[+] sshine|11 years ago|reply
I've wondered why most existing LALR(1) parser generators are not replaced with LR parser generators with a higher look-ahead than 1. This improvement, considering that our computational power today does not justify these restrictions any more, would be Pareto-optimal even while it is being decided what completely alternative strategies (PEGs, boolean grammars, parser combinators, etc.) will take over.
[+] wslh|11 years ago|reply
ANTLR is LL(*).
[+] jeffreyrogers|11 years ago|reply
This is a really interesting post. The same author recently posted another article that discusses some of the ideas from the conclusion of this post. You can find that article here: http://tratt.net/laurie/blog/entries/an_editor_for_composed_...
[+] dkarapetyan|11 years ago|reply
The more generic term for a lot of the stuff he talks about in that post falls under projectional editors and programming language workbenches. JetBrains MPS is one of the tools among many that help with building such editors. Here's a good talk by Markus Völter on using tools like JetBrains MPS.
[+] jdnier|11 years ago|reply
Instaparse, a parsing library for Clojure, tries to make context-free grammars "as easy to use as regular expressions." It uses an EBNF input format that should be familiar to many programmers. https://github.com/Engelberg/instaparse