(no title)
madgar | 9 years ago
His favored approach had one artificial limitation, perhaps a remnant of the age: he limited the size of source files in bytes to some power of 2. This let him represent each token (incl. white space/new lines/comments) as a pair of fixed-size integers indexing into the file bytes. The tokenized file is an array of those pairs and the concrete syntax tree is a tree where leaf nodes are indices into the array of tokens.
Suitable for syntax-directed code generation, control-flow graph generation, static analysis, linting, pretty-printing. Super memory compact and even has an upper bound on memory footprint. The caveat is that the source language does need to support composing a "module" out of multiple files because of the limit on source file size.
chubot|9 years ago
Do you have reference for this? I saw this problem mentioned in the write-up on the ZINC Abstract Machine by Xavier Leroy (author of OCaml). But I don't know of other papers that talk about this.
Also I believe that most open source tools do NOT have this functionality. Look at lib2to3. It's bolted on -- not exactly a clean design. Most open source front ends are not designed for tooling like Clang is.
tacostakohashi|9 years ago
madgar|9 years ago
Edit: I'm personally not surprised that open-source codebases don't employ this technique. Lots of great PL work was done for private companies until the 90s and while lots of work was published in papers and books, precious few open-source PL communities historically drew from academia. I'm sure you know the counter-examples.