top | item 35817232

(no title)

The title of this post stood out to me because it's a fairly niche topic that I've been meaning to write a 'rant' post about myself. I think in a very 'pure' parsing theory sense, there is no real advantage to separating the lexer and the parser. I claim that any counter-argument is due to the limitations of individual software tools, and their inability to let up specify the grammar that we actually want.

If I recall correctly, the original historical reason for separating them was due to memory limitations on early mainframe computers. Actual source code would be fed into the 'lexer' and a stream of 'tokens' would come out the other end and then be saved to a file, which would then be fed into the next stage.

Having said this, you can ask "In practice, is there currently an advantage to separating the lexer and the parser with the tools we use now?" The answer is 'yes', but I claim that this is just a limitation of the tools we have available to us today. Tokens are usually described using a simple regular expression, whereas the parsing rules are 'context free', so worst-case complexity of parsing the two is not the same. If you pull tokens into the parser and just parse them as naive 'single-character parser items', then you end up doing more work than you would have otherwise since your parser is going to try all kinds of unlikely and impossible chopped up token combinations. The other big issue is memory savings. Turning every token into a (small) parse tree is going to increase memory usage 10-20x (depending on the length of your tokens).

Personally, I think there is a substantial need for more advanced ways of describing language grammars. The ideal model would be to have various kinds of annotations or modifiers that you could attach to grammar rules, and then the parser generator would do all sorts optimizations based on the constraints that these annotations imply. Technically, we already do this. It's called putting the lexer in one file, and the parser in another file. There is no good reason why we can't specify both using a more unified grammar model and let the parser generator figure out what to do about each rule to be as efficient as possible.

The computer science theory behind grammars seems to have peaked in the 1980s, and there doesn't seem to have been many new innovations that have made it into daily programming life since then.

discuss

No comments yet.