(no title)
badgar | 12 years ago
The token stream is an array of 32 bit integers, each of which is the token type bitmasked onto an index into the input file of the start of the token. If you need the token text, you reparse. Caches can be implemented as small hash maps from token index to cached value.
The canonical AST is the CFG parse tree with fixups to convert recursion to children of a node type for a variable number of children. It is stored as an array of integers. The node is a sequence of integers, with one integer for the root followed by each child in sequence. Each internal node's value is the index of the CFG rule evaluated to produce the node, and the children of the node correspond to the CFG rule's right hand side (minus keywords). Terminals are stored as integer indexes into the token stream. Nonterminals are the integer index of the child internal node.
Bill has been a big name in compilers for the better part of 5 decades now, and he said he's been using this pattern for almost as long. It's ridiculously fast, which is why he used it decades ago for DEC.
No comments yet.