top | item 19244950

(no title)

mekaj | 7 years ago

By default I’m skeptical of hand-written parsers, especially if they’re implemented in a language lacking memory safety features. I have seen too many parsers fall victim to (domain specific) fuzzing, and the consequences are serious security vulnerabilities.

Parser combinators are great, but they’re still hand-written in excessively powerful languages. That’s error-prone if the goal is to end up with a trustworthy, portable parser. The most frequently used parser combinator tools I know of are embedded DSLs, which means porting such a parser to another language requires manually converting to another parser combinator’s syntax and semantics and potentially implementing a new DSL which hasn’t yet been battle tested.

Parser generators which are in a position to get battle tested by supporting many parser implementations and integrations seem like the safest and most scalable way to go. I’m not familiar with Semantic Designs’ GLR implementation; thanks for mentioning it. Another worth knowing is tree-sitter (https://tree-sitter.github.io/tree-sitter/). As a GLR-based parser generator it supports ambiguous context-free grammars. At first ambiguity may sound like a downside, but without a computer-verified proof that your grammar doesn’t admit multiple valid parses for any inputs it’s safest to assume ambiguity and proceed accordingly. For example, if at parse-time multiple parses are found you have the option to warn the user or even reject their input with a diff of the ASTs so they can try to patch the input so it’s no longer ambiguous.

Tree-sitter is young, but it has reach. Atom editor packages for languages are encouraged to implement tree-sitter parsers. It’s particularly well-suited for editors because it supports incremental parsing. Parsing the same file from scratch after every edit isn’t necessary.

Recently I implemented a tree-sitter grammar for ABNF (https://github.com/jmitchell/tree-sitter-abnf) and an Atom package (https://atom.io/packages/language-abnf) for it. Now I’m exploring how to generate tree-sitter parsers for grammars specified using ABNF, such as Dhall’s (https://github.com/dhall-lang/dhall-lang/blob/master/standar...). My plan is to automatically generate and fuzz C-friendly parser libraries for many languages to make them more accessible. I also want to help address any tree-sitter bugs discovered along the way that could translate into security vulnerabilities in generated parsers. For now this is a labor of love, and I am not affiliated with the tree-sitter project.

discuss

order

No comments yet.