Aycock & Horspool came up with a 'practical' method for implementing Earley parsing (conversion to a state-machine) that has pretty humorously good performance delta over "naive" Earley, and is still reasonable to implement. Joop Leo figured out how to get the worst-case of Earley parsing down to either O(n) (left-recursive, non-ambiguous) or O(n^2) (right-recursive, non-ambiguous). That means the Earley algorithm is only O(n^3) on right-recursive, ambiguous grammars; and, if you're doing that, you're holding your language wrong.
A somewhat breathless description of all of this is in the Marpa parser documentation:
https://jeffreykegler.github.io/Marpa-web-site/
In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enoughâ„¢:
An extremely layman answer is that most interesting innovation in parsing in relatively modern times has happened seems to be in the context of IDE's. I.e. incremental, high-performance parsing to support syntax highlighting, refactoring, etc. etc.
Actually the most important step of parsers (as even non-incremental, slow (or better: not fast) parsers are fast enough) is error recovery (error resilience) from syntax errors (mostly half written or half deleted code).
What is time consuming is e.g. type-checking. Semantic checking in general, like exhaustiveness checks of pattern matches, syntax checking is fast.
Not sure, but I at least am certainly aware of possibilities that such writeups exclude.
In particular, you can do (a subset of) the following in sequence:
* write your own grammar in whatever bespoke language you want
* compose those grammars into a single grammar
* generate a Bison grammar from that grammar
* run `bison --xml` instead of actually generating code
* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues
In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.
I'm not super familiar with the space, but tree-sitter seems to take an interesting approach in that they are an incremental parser. So instead of re-parsing the entire document on change, it only parses the affected text, thereby making it much more efficient for text editors.
I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.
In my experience incremental parsing doesn't really make much sense. Non-incremental parsing can easily parse huge documents in milliseconds.
Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.
I feel that most of the time the two options are presented as either write a handwritten parser or use a parser generator. A nice third way is to write a custom parser generator for the language you wish to parse. Handwritten parsers do tend to get unwieldy and general purpose parser generators can have inscrutable behavior for any specific language.
Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.
thechao|2 years ago
A somewhat breathless description of all of this is in the Marpa parser documentation:
In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enoughâ„¢:marcusf|2 years ago
(I may be talking out of my ass here.)
ReleaseCandidat|2 years ago
o11c|2 years ago
In particular, you can do (a subset of) the following in sequence:
* write your own grammar in whatever bespoke language you want
* compose those grammars into a single grammar
* generate a Bison grammar from that grammar
* run `bison --xml` instead of actually generating code
* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues
In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.
sse|2 years ago
https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/
https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...
danielvaughn|2 years ago
I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.
IshKebab|2 years ago
Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.
I prefer Chumsky or Nom which go all the way.
ReleaseCandidat|2 years ago
No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).
For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066
troupo|2 years ago
davidkellis|2 years ago
bsder|2 years ago
norir|2 years ago
Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.
dboreham|2 years ago