- Don't try to write a scannerless parser. It's a fool's errand. Lex first into tokens (e.g. LEFT_PAREN, KEYWORD(for), STRING_LITERAL("foo"), NUM_LITERAL(7), etc), then parse that.
- By the same, uh, token: keep everything modular and loosely coupled. Don't mix up parsing state into your lexer (unless you're lexing C or Python, famously), don't mix up lexing code into your interpreter, etc.
- For parsing, start with a simpler grammar. Parsing mathematical expressions is a classic example. Avoid complex grammars that require lots of backtracking or lookaheads. If you want a real language, Go is a good example - a large part of its famous compilation speed is due to its simplicity (due to the fact that its authors are old men who live in a counterfactual version of the 90s imagined in the 70s).
- YMMV, but I find the best approach to parsing is a packrat-inspired bottom-up parser. Iterate over the tokens, and for each token filter your list of rules ('productions') to those which match. 'Reduce' the simpler expressions as you go, and build the more complex expressions out of them (e.g. functions will typically have several statements/expressions, expressions several operations, etc).
- For the compilation step, unless you specifically want to learn about writing object code, then target a 'backend' IR like GCC or LLVM. You'll benefit from their optimisations, and the vast number of platforms they support.
- Choose a language you're familiar with - ideally a simple one - to write it in. You don't want to be learning a new language as you're doing this, trust me.
samhw|4 years ago
If I have any advice, it's:
- Don't try to write a scannerless parser. It's a fool's errand. Lex first into tokens (e.g. LEFT_PAREN, KEYWORD(for), STRING_LITERAL("foo"), NUM_LITERAL(7), etc), then parse that.
- By the same, uh, token: keep everything modular and loosely coupled. Don't mix up parsing state into your lexer (unless you're lexing C or Python, famously), don't mix up lexing code into your interpreter, etc.
- For parsing, start with a simpler grammar. Parsing mathematical expressions is a classic example. Avoid complex grammars that require lots of backtracking or lookaheads. If you want a real language, Go is a good example - a large part of its famous compilation speed is due to its simplicity (due to the fact that its authors are old men who live in a counterfactual version of the 90s imagined in the 70s).
- YMMV, but I find the best approach to parsing is a packrat-inspired bottom-up parser. Iterate over the tokens, and for each token filter your list of rules ('productions') to those which match. 'Reduce' the simpler expressions as you go, and build the more complex expressions out of them (e.g. functions will typically have several statements/expressions, expressions several operations, etc).
- For the compilation step, unless you specifically want to learn about writing object code, then target a 'backend' IR like GCC or LLVM. You'll benefit from their optimisations, and the vast number of platforms they support.
- Choose a language you're familiar with - ideally a simple one - to write it in. You don't want to be learning a new language as you're doing this, trust me.