I'm delighted to see this get some attention here.
I absolutely love these techniques for parsing, but as my primary research areas is static analysis, I haven't had time to revise this paper and resubmit.
As it stands, I may never get the time to do so. :(
I posted it on arxiv so that David could reference it for his Ph.D. school apps.
Since some of you have asked, here are the reviews:
I do have an updated implementation that's much cleaner and faster, and I've been planning to do that in a blog post. (Alas, no opportunity yet.)
David's also done another implementation in Haskell that's screaming fast and efficient on many restricted classes of grammars (like LL(k)). I'll encourage him to post that as well.
If you're interested in getting your name on a scientific publication and helping this get the attention of the scientific community, you can help us by creating an implementation of either technique in your favorite language and beating on it to help find the inefficiencies.
(For instance, the original Scala version linked from this paper has memory leaks from the way it caches derivatives. We got around them by rolling top-level repetition by hand.)
Please email me if that's something you're interested in doing.
"this is not quantum theory, after all..." is one of the funnier comments I've seen on a review. Judging by the rest of it, I'm tempted to guess that it was written by someone from team-PLT :)
Could either you or David put the source for the LL(k) version up somewhere? Comparing it head-to-head with parsec (or its faster cousins) something fun to do.
I haven't finished the paper yet, but this type of technique excites me as well. Recently, I did an assignment on the paper "memoization in top-down parsing" by Mark Johnson. Have you seen this paper, and do you think that there are any similarities between the derivatives of a CFG and continuations of parsing procedures (for variables and their related productions)?
I haven't studies either approach well but they seem similar in that both are like a breadth first traversal of the parser combinators, instead of the traditional depth first (backtracking) traversal. That is, when you have a union A u B you process this by advancing both sides in tandem D_c A u D_c B, instead of the traditional approach of first computing all results of A on the entire input and then computing all results of B on the entire input?
Parallel Parsing Processes don't seem to handle left recursion. On the other hand they parse S expressions in linear time without a special coded repetition operation ;)
When I find time I'm going to try to do an F# implementation.
Here's what a grammar actually looks like in their scala version. This grammar is for arithmetic over the language where x represents 1, and s represents +. This only generates the parse tree, not the final answer. Comments are added by me:
// The Nodes in the parse tree
abstract class Exp
case object One extends Exp
case class Sum(e1 : Exp, e2 : Exp) extends Exp
// Terminals
lazy val S : Parser[Char,Char] = new EqT[Char] ('s')
lazy val X : Parser[Char,Char] = new EqT[Char] ('x')
// Definition of an expression
// I'm pretty sure all the asInstanceOfs are avoidable/unnecessary.
lazy val EXP : Parser[Char,Exp] =
rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
rule(EXP ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(e1,e2).asInstanceOf[Exp] } ||
rule(EXP ~ S ~ X) ==> { case e1 ~ s ~ e2 => Sum(e1,One).asInstanceOf[Exp] } ||
rule(X ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(One,e2).asInstanceOf[Exp] } ||
rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
rule(EXP) ==> { case e => e } ||
rule(Epsilon[Char]) ==> { case () => One }
// Actually run the rule
val xin = Stream.fromIterator("xsxsxsxsx".elements)
EXP.parseFull(xin)
// return value => Stream(Sum(One,Sum(Sum(One,One),Sum(One,One))), ?)
When I, as an Amateur, last looked into it, PEG[1] and extensions like OMeta[2] seemed to be the best options. I've heard good things about Parsec[3], too.
I am not so sure about the actual state-of-the-art (I have read some PEG papers and some OMeta stuff from vpri, plus used ANTLR, Bison, and Coco/R), but a very interesting comment on HN (http://news.ycombinator.com/item?id=1643715) points out that most production compilers use hand-written recursive-descent parsers, primarily due to practical reasons.
I have taken two compiler construction courses during my college years, one with LL(1) grammar using a recursive descent parser, the other one with LR(1) grammar using flex+bison tool-chain. I found the experience from the LL case helped me tremendously in the course of writing the LR-language compiler. I think that recursive-descent experience also helps understanding attribute grammars, too. Probably the best introduction to recursive-descent parsers is from Niklaus Wirth's compiler construction book (http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf).
There was a really cool parsing paper at POPL this year [1] that goes 'beyond CFGs'... but as far as I know their system YAKKER is still in development / unreleased.
I think from a user's perspective, e.g. someone wanting to design a DSL, it would be nice to be able to write down arbitrary CFGs and not necessarily have to know lots of technical parsing details ("what are all these shift-reduce conflicts!?"). Maybe PEGs aren't the best choice for this (no left recursion)-- MetaBorg's SGLR is probably state-of-the-art then like Zef said.
Though one of the nice things about PEGs is no ambiguity; MetaBorg will do type-based disambiguation but only after parsing finishes. I'm working on an undergraduate thesis on 'extensible syntax' right now, trying to jump off from some of their stuff (but using a variation on Earley parsing) in particular to see if we could use types to help disambiguate incrementally during parsing.
Fascinating. I don't understand it all yet - I'm reading it carefully but it'll be weeks before I really get it.
However ...
I'm getting the feeling that this encompasses properly a feeling I've had about parsing for some time, that there should be a way of parsing the entire text, with the parse settling onto the text all at the same time. I don't know if this is what it's saying, but that's the sense I get from it.
Note that the post has 'prerequisites' at the beginning, which you will need to read, but they are actually pretty cool.
Also I am not saying this is exactly what you mean, I'm just suggesting it as a possible connection.
This sort of thing is one of the admittedly-rare exceptions where computer science is actually making surprising amounts of progress in relatively practical fields. Parsing has gotten noticeably easier in the past ten years, if you know where to look for the right libraries, and it has been affecting my programming quite a bit. Often, a parser is the "correct" solution, but we used to reach up for hacked up crap with regular expressions or worse because it was ten times easier and did 80% of the job (and ignore the 10% that is a serious security vulnerability since everybody always does). Now it's maybe twice as hard, or, given how easy it is to underestimate the difficulty of getting the hacked up crap to actually work everywhere in the real world you need it to, sometimes it's just flat-out easier if you make a full accounting of costs to actually do it correctly.
I have since rewritten the Haskell implementation to compute fixed points on cyclic graphs without using pointers or Monads. Check it out if it interests you (git repo):
Possibly because they hand-waved away a lot of formal specification of their solution. For example, there was no attempt to analyze the complexity of the generated rules trees.
[+] [-] mattmight|15 years ago|reply
I'm delighted to see this get some attention here.
I absolutely love these techniques for parsing, but as my primary research areas is static analysis, I haven't had time to revise this paper and resubmit.
As it stands, I may never get the time to do so. :(
I posted it on arxiv so that David could reference it for his Ph.D. school apps.
Since some of you have asked, here are the reviews:
http://matt.might.net/papers/reviews/esop2010-derivatives.tx...
I do have an updated implementation that's much cleaner and faster, and I've been planning to do that in a blog post. (Alas, no opportunity yet.)
David's also done another implementation in Haskell that's screaming fast and efficient on many restricted classes of grammars (like LL(k)). I'll encourage him to post that as well.
If you're interested in getting your name on a scientific publication and helping this get the attention of the scientific community, you can help us by creating an implementation of either technique in your favorite language and beating on it to help find the inefficiencies.
(For instance, the original Scala version linked from this paper has memory leaks from the way it caches derivatives. We got around them by rolling top-level repetition by hand.)
Please email me if that's something you're interested in doing.
Can HN do science? I'd love to find out.
[+] [-] copper|15 years ago|reply
Could either you or David put the source for the LL(k) version up somewhere? Comparing it head-to-head with parsec (or its faster cousins) something fun to do.
[+] [-] davdar|15 years ago|reply
http://david.darais.com/git/research/der-parser-3/
[+] [-] k4st|15 years ago|reply
[+] [-] jules|15 years ago|reply
http://www.cse.chalmers.se/edu/course/afp/Papers/parser-clae...
I haven't studies either approach well but they seem similar in that both are like a breadth first traversal of the parser combinators, instead of the traditional depth first (backtracking) traversal. That is, when you have a union A u B you process this by advancing both sides in tandem D_c A u D_c B, instead of the traditional approach of first computing all results of A on the entire input and then computing all results of B on the entire input?
Parallel Parsing Processes don't seem to handle left recursion. On the other hand they parse S expressions in linear time without a special coded repetition operation ;)
When I find time I'm going to try to do an F# implementation.
[+] [-] timtadh|15 years ago|reply
[+] [-] fizx|15 years ago|reply
[+] [-] johkra|15 years ago|reply
When I, as an Amateur, last looked into it, PEG[1] and extensions like OMeta[2] seemed to be the best options. I've heard good things about Parsec[3], too.
[1](http://en.wikipedia.org/wiki/Parsing_expression_grammar) [2](http://www.tinlizzie.org/ometa/) [3](http://legacy.cs.uu.nl/daan/parsec.html)
[+] [-] sb|15 years ago|reply
I have taken two compiler construction courses during my college years, one with LL(1) grammar using a recursive descent parser, the other one with LR(1) grammar using flex+bison tool-chain. I found the experience from the LL case helped me tremendously in the course of writing the LR-language compiler. I think that recursive-descent experience also helps understanding attribute grammars, too. Probably the best introduction to recursive-descent parsers is from Niklaus Wirth's compiler construction book (http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf).
[+] [-] dill_day|15 years ago|reply
I think from a user's perspective, e.g. someone wanting to design a DSL, it would be nice to be able to write down arbitrary CFGs and not necessarily have to know lots of technical parsing details ("what are all these shift-reduce conflicts!?"). Maybe PEGs aren't the best choice for this (no left recursion)-- MetaBorg's SGLR is probably state-of-the-art then like Zef said.
Though one of the nice things about PEGs is no ambiguity; MetaBorg will do type-based disambiguation but only after parsing finishes. I'm working on an undergraduate thesis on 'extensible syntax' right now, trying to jump off from some of their stuff (but using a variation on Earley parsing) in particular to see if we could use types to help disambiguate incrementally during parsing.
[1] http://portal.acm.org/citation.cfm?id=1706347
[+] [-] Zef|15 years ago|reply
[+] [-] DanielRibeiro|15 years ago|reply
Despite all of these, adoption of better techniques is really slow. As usual.
[+] [-] sliverstorm|15 years ago|reply
[+] [-] RiderOfGiraffes|15 years ago|reply
However ...
I'm getting the feeling that this encompasses properly a feeling I've had about parsing for some time, that there should be a way of parsing the entire text, with the parse settling onto the text all at the same time. I don't know if this is what it's saying, but that's the sense I get from it.
But even if it isn't, it looks intriguing.
[+] [-] jerf|15 years ago|reply
Note that the post has 'prerequisites' at the beginning, which you will need to read, but they are actually pretty cool.
Also I am not saying this is exactly what you mean, I'm just suggesting it as a possible connection.
This sort of thing is one of the admittedly-rare exceptions where computer science is actually making surprising amounts of progress in relatively practical fields. Parsing has gotten noticeably easier in the past ten years, if you know where to look for the right libraries, and it has been affecting my programming quite a bit. Often, a parser is the "correct" solution, but we used to reach up for hacked up crap with regular expressions or worse because it was ten times easier and did 80% of the job (and ignore the 10% that is a serious security vulnerability since everybody always does). Now it's maybe twice as hard, or, given how easy it is to underestimate the difficulty of getting the hacked up crap to actually work everywhere in the real world you need it to, sometimes it's just flat-out easier if you make a full accounting of costs to actually do it correctly.
[+] [-] antimatter15|15 years ago|reply
[+] [-] davdar|15 years ago|reply
http://david.darais.com/git/research/der-parser-3/
[+] [-] benblack|15 years ago|reply
[+] [-] amichail|15 years ago|reply
[+] [-] stupidsignup|15 years ago|reply
and the response to these reviews here: http://phlegmaticprogrammer.wordpress.com/2010/11/21/respons...
[+] [-] fizx|15 years ago|reply
[+] [-] mahmud|15 years ago|reply
[+] [-] herdrick|15 years ago|reply
[+] [-] kleiba|15 years ago|reply