Yacc is dead | WingNews

[+] mattmight|15 years ago|reply

(Article author here.)

I'm delighted to see this get some attention here.

I absolutely love these techniques for parsing, but as my primary research areas is static analysis, I haven't had time to revise this paper and resubmit.

As it stands, I may never get the time to do so. :(

I posted it on arxiv so that David could reference it for his Ph.D. school apps.

Since some of you have asked, here are the reviews:

http://matt.might.net/papers/reviews/esop2010-derivatives.tx...

I do have an updated implementation that's much cleaner and faster, and I've been planning to do that in a blog post. (Alas, no opportunity yet.)

David's also done another implementation in Haskell that's screaming fast and efficient on many restricted classes of grammars (like LL(k)). I'll encourage him to post that as well.

If you're interested in getting your name on a scientific publication and helping this get the attention of the scientific community, you can help us by creating an implementation of either technique in your favorite language and beating on it to help find the inefficiencies.

(For instance, the original Scala version linked from this paper has memory leaks from the way it caches derivatives. We got around them by rolling top-level repetition by hand.)

Please email me if that's something you're interested in doing.

Can HN do science? I'd love to find out.

[+] copper|15 years ago|reply

"this is not quantum theory, after all..." is one of the funnier comments I've seen on a review. Judging by the rest of it, I'm tempted to guess that it was written by someone from team-PLT :)

Could either you or David put the source for the LL(k) version up somewhere? Comparing it head-to-head with parsec (or its faster cousins) something fun to do.

[+] davdar|15 years ago|reply

Here is my latest Haskell implementation that Matt is referring to (git repo):

http://david.darais.com/git/research/der-parser-3/

[+] k4st|15 years ago|reply

I haven't finished the paper yet, but this type of technique excites me as well. Recently, I did an assignment on the paper "memoization in top-down parsing" by Mark Johnson. Have you seen this paper, and do you think that there are any similarities between the derivatives of a CFG and continuations of parsing procedures (for variables and their related productions)?

[+] jules|15 years ago|reply

How does your approach compare with Koen Claessen's Parallel Parsing Processes?

http://www.cse.chalmers.se/edu/course/afp/Papers/parser-clae...

I haven't studies either approach well but they seem similar in that both are like a breadth first traversal of the parser combinators, instead of the traditional depth first (backtracking) traversal. That is, when you have a union A u B you process this by advancing both sides in tandem D_c A u D_c B, instead of the traditional approach of first computing all results of A on the entire input and then computing all results of B on the entire input?

Parallel Parsing Processes don't seem to handle left recursion. On the other hand they parse S expressions in linear time without a special coded repetition operation ;)

When I find time I'm going to try to do an F# implementation.

[+] timtadh|15 years ago|reply

emailed!

[+] fizx|15 years ago|reply

Here's what a grammar actually looks like in their scala version. This grammar is for arithmetic over the language where x represents 1, and s represents +. This only generates the parse tree, not the final answer. Comments are added by me:

  // The Nodes in the parse tree
  abstract class Exp
  case object One extends Exp
  case class Sum(e1 : Exp, e2 : Exp) extends Exp
  
  // Terminals
  lazy val S : Parser[Char,Char] = new EqT[Char] ('s')
  lazy val X : Parser[Char,Char] = new EqT[Char] ('x')
  
  // Definition of an expression
  // I'm pretty sure all the asInstanceOfs are avoidable/unnecessary.
  lazy val EXP : Parser[Char,Exp] = 
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(e1,e2).asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ X) ==> { case e1 ~ s ~ e2 => Sum(e1,One).asInstanceOf[Exp] } ||
    rule(X ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(One,e2).asInstanceOf[Exp] } ||
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP) ==> { case e => e } ||
    rule(Epsilon[Char]) ==> { case () => One } 
  
  // Actually run the rule
  val xin = Stream.fromIterator("xsxsxsxsx".elements)
  EXP.parseFull(xin)
  // return value => Stream(Sum(One,Sum(Sum(One,One),Sum(One,One))), ?)

[+] johkra|15 years ago|reply

What is actually state-of-the-art in parsing?

When I, as an Amateur, last looked into it, PEG[1] and extensions like OMeta[2] seemed to be the best options. I've heard good things about Parsec[3], too.

[1](http://en.wikipedia.org/wiki/Parsing_expression_grammar) [2](http://www.tinlizzie.org/ometa/) [3](http://legacy.cs.uu.nl/daan/parsec.html)

[+] sb|15 years ago|reply

I am not so sure about the actual state-of-the-art (I have read some PEG papers and some OMeta stuff from vpri, plus used ANTLR, Bison, and Coco/R), but a very interesting comment on HN (http://news.ycombinator.com/item?id=1643715) points out that most production compilers use hand-written recursive-descent parsers, primarily due to practical reasons.

I have taken two compiler construction courses during my college years, one with LL(1) grammar using a recursive descent parser, the other one with LR(1) grammar using flex+bison tool-chain. I found the experience from the LL case helped me tremendously in the course of writing the LR-language compiler. I think that recursive-descent experience also helps understanding attribute grammars, too. Probably the best introduction to recursive-descent parsers is from Niklaus Wirth's compiler construction book (http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf).

[+] dill_day|15 years ago|reply

There was a really cool parsing paper at POPL this year [1] that goes 'beyond CFGs'... but as far as I know their system YAKKER is still in development / unreleased.

I think from a user's perspective, e.g. someone wanting to design a DSL, it would be nice to be able to write down arbitrary CFGs and not necessarily have to know lots of technical parsing details ("what are all these shift-reduce conflicts!?"). Maybe PEGs aren't the best choice for this (no left recursion)-- MetaBorg's SGLR is probably state-of-the-art then like Zef said.

Though one of the nice things about PEGs is no ambiguity; MetaBorg will do type-based disambiguation but only after parsing finishes. I'm working on an undergraduate thesis on 'extensible syntax' right now, trying to jump off from some of their stuff (but using a variation on Earley parsing) in particular to see if we could use types to help disambiguate incrementally during parsing.

[1] http://portal.acm.org/citation.cfm?id=1706347

[+] Zef|15 years ago|reply

State of the art in parsing is SGLR (http://strategoxt.org/Sdf/SGLR) and GLL parsing

[+] DanielRibeiro|15 years ago|reply

Well, Antlr (http://www.antlr.org/) has replaced yacc for most of the practical work already. Daniel Spiewak also mentions parser combinators and how GLL parsers can be much easier to use and yet fast enough most of the time (http://www.codecommit.com/blog/scala/unveiling-the-mysteries...). He mentions parser combinators as domain specif languages for creating parsers (http://www.codecommit.com/blog/scala/the-magic-behind-parser...).

Despite all of these, adoption of better techniques is really slow. As usual.

[+] sliverstorm|15 years ago|reply

Heh... Yacc, Bison, Antlr...

[+] RiderOfGiraffes|15 years ago|reply

Fascinating. I don't understand it all yet - I'm reading it carefully but it'll be weeks before I really get it.

However ...

I'm getting the feeling that this encompasses properly a feeling I've had about parsing for some time, that there should be a way of parsing the entire text, with the parse settling onto the text all at the same time. I don't know if this is what it's saying, but that's the sense I get from it.

But even if it isn't, it looks intriguing.

[+] jerf|15 years ago|reply

You mean something like this?: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...

Note that the post has 'prerequisites' at the beginning, which you will need to read, but they are actually pretty cool.

Also I am not saying this is exactly what you mean, I'm just suggesting it as a possible connection.

This sort of thing is one of the admittedly-rare exceptions where computer science is actually making surprising amounts of progress in relatively practical fields. Parsing has gotten noticeably easier in the past ten years, if you know where to look for the right libraries, and it has been affecting my programming quite a bit. Often, a parser is the "correct" solution, but we used to reach up for hacked up crap with regular expressions or worse because it was ten times easier and did 80% of the job (and ignore the 10% that is a serious security vulnerability since everybody always does). Now it's maybe twice as hard, or, given how easy it is to underestimate the difficulty of getting the hacked up crap to actually work everywhere in the real world you need it to, sometimes it's just flat-out easier if you make a full accounting of costs to actually do it correctly.

[+] antimatter15|15 years ago|reply

The implementations are at http://www.ucombinator.org/projects/parsing/

[+] davdar|15 years ago|reply

I have since rewritten the Haskell implementation to compute fixed points on cyclic graphs without using pointers or Monads. Check it out if it interests you (git repo):

http://david.darais.com/git/research/der-parser-3/

[+] benblack|15 years ago|reply

So many useful links in this discussion, I consolidated them (and a few extras) here http://post.b3k.us/modern-parsing-bibliography

[+] amichail|15 years ago|reply

Why was it rejected by ESOP?

[+] stupidsignup|15 years ago|reply

Well, if you want to know what ESOP rejection reviews look like: http://phlegmaticprogrammer.wordpress.com/2010/11/21/reviews...

and the response to these reviews here: http://phlegmaticprogrammer.wordpress.com/2010/11/21/respons...

[+] fizx|15 years ago|reply

Possibly because they hand-waved away a lot of formal specification of their solution. For example, there was no attempt to analyze the complexity of the generated rules trees.

[+] mahmud|15 years ago|reply

Brzozowski's derivatives of regular expressions has been in use in the PLT compiler tools for ages now.

[+] herdrick|15 years ago|reply

Better comments at programming.reddit, sadly: http://www.reddit.com/r/programming/comments/ed2pb/yacc_is_d...

[+] kleiba|15 years ago|reply

Do I hear John McCarthy chuckle?

44 comments