top | item 18680236

(no title)

morazow | 7 years ago

Off-topic.

Does anyone know how I can convert AST into stream of bytecodes? Are there any good example language implementations to learn?

discuss

carapace|7 years ago

A little tangential, but check out "Prolog as Description and Implementation Language in Computer Science Teaching"by Henning Christiansen

http://www.ep.liu.se/ecp/012/004/ecp012004.pdf

> ... Definitional interpreters, compilers, and other models of computation are defined in a systematic way as Prolog programs, and as a result, formal descriptions become running prototypes that can be tested and modified ... These programs can be extended in straightforward ways into tools such as analyzers, tracers and debuggers.

Also "Logic Programming and Compiler Writing" By David Warren (and the work that followed on.)

chubot|7 years ago

If you want to play with the algorithms at a high level [1], there are two Python AST -> bytecode implementations I've seen, and one that I'm using.

(1) The 'compiler' module from Python 2. I'm using this to build my shell project Oil, so it works.

See Building Oil with the OPy Bytecode Compiler: http://www.oilshell.org/blog/2018/03/04.html . The heavily refactored source is in the opy/compiler2 directory of the repo on Github.

(This module was removed in Python 3 due to its redundancy with the C implementation. But there's no conceptual difference between Python 2 and 3. Some small details have changed.)

(2) tailbiter, which I mentioned here: http://www.oilshell.org/blog/2018/03/27.html

They are written in very different styles. Tailbiter reads like Lisp code, which may or may not be what you want.

The generated bytecode is different in the sense that the values it operates on aren't just ints and floats (which hardware deals with), but they are dynamically typed objects. But this doesn't change the overall algorithm to generate bytecode from an AST.

[1] which I recommend over doing it in C or C++, because ASTs are somewhat annoying in those languages

chrisseaton|7 years ago

Walk the tree post-order and emit an instruction for each node.

stevekemp|7 years ago

https://compilerbook.com/ is a good read ..

kazinator|7 years ago

I cranked out a compiler, assembler and VM for TXR Lisp earlier this year.

The code is very clean, and fairly suitable for study purposes. It is not commented, though.

http://www.kylheku.com/cgit/txr/tree/share/txr/stdlib/compil...

http://www.kylheku.com/cgit/txr/tree/share/txr/stdlib/asm.tl

http://www.kylheku.com/cgit/txr/tree/vm.c

The assembler accepts programs in a kind of Lisp syntax. It allows for symbolic labels and performs all the backpatching. The binary instruction format is written out with the help of the buf data type and some ffi-related functions for reading and writing binary types to and from a buffer.

Because the assembly code a Lisp data structure, the compiler can emit the instruction templates using backquote templates. Pieces of code can be catenated with append since they are just a list and so on.

The compiler comp-if method compiles (if ...) forms. The most general case with all three arguments (if expr then else) uses this code template:

   ^(,*te-frag.code
     (if ,te-frag.oreg ,lelse)
     ,*th-frag.code
     ,*(maybe-mov oreg th-frag.oreg)
     (jmp ,lskip)
     ,lelse
     ,*el-frag.code
     ,*(maybe-mov oreg el-frag.oreg)
     ,lskip)

The first element ,[star]te-frag.code splices in the code part of the fragment obtained from compiling the test expression. That code has to be run first. Then we emit the (if ...) instruction which tests the te-frag.oreg: the output register where the te-frag.code has stored the value. If the test is successful, the instructions continue below, otherwise a branch takes place to the else part. The else part is identified by lelse which holds the label: the ,lelse backquote syntax inserts that label into the template. So after the (if ...) instruction we splice the th-frag.code: the "then" fragment. After that we jump past the else part to whatever follows via (jmp ,lskip): jump to the skip label. Before this, we potentially insert a mov instruction that may be needed. We are expected to produce the result value of the (if ...) expression in an output register held in oreg. But the th-frag.code puts the result into th-frag.oreg: its own output register. Now those two may be the same. If they are not the same register, then a move is needed: the maybe-mov function produces the code when the registers are different, or else an empty list.

In action:

  This is the TXR Lisp interactive listener of TXR 203.
  Quit with :quit or Ctrl-D on empty line. Ctrl-X ? for cheatsheet.
  1> (compile-toplevel '(if x y z))
  ** warning: (expr-1:1) unbound variable x
  ** warning: (expr-1:1) unbound variable y
  ** warning: (expr-1:1) unbound variable z
  #<sys:vm-desc: 9739370>
  2> (disassemble *1)
  data:
  syms:
      0: x
      1: y
      2: z
  code:
      0: A0020000 getlx t002 0
      1: 48000005 if t002 5
      2: 00000002
      3: A0020001 getlx t002 1
      4: 44000006 jmp 6
      5: A0020002 getlx t002 2
      6: 10000002 end t002
  instruction count:
      6
  #<sys:vm-desc: 9739370>

Here, getlx t002 0 means look up the lexical variable named by the symbol at position [0] in the syms table, in other words x. Put the value into register t002. Then if t002 5 means, if t002 is true (non-nil), then continue, else branch to the instruction at offset 5.

end t002 means that the VM is done executing and its result value is in t002.

We can intercept the call to the assembler asm method:

  3> (trace (meth sys:assembler sys:asm))
  nil
  4> (compile-toplevel '(if x y z))
  ** warning: (expr-4:1) unbound variable x
  ** warning: (expr-4:1) unbound variable y
  ** warning: (expr-4:1) unbound variable z
  ((meth sys:assembler
     sys:asm) (#S(sys:assembler buf #b''
                                bstr #<buf-stream b767e90c> sys:max-treg 0 sys:labdef #H(())
                                sys:labref #H(()))
              ((sys:getlx (t 2) 0) (if (t 2) #:l0019) (sys:getlx (t 2) 1) (sys:jmp #:l0020)
               #:l0019 (sys:getlx (t 2) 2) #:l0020 (end (t 2))))
  nil)

#<sys:vm-desc: 972a210>

Here the code looks like:

  ((sys:getlx (t 2) 0)
   (if (t 2) #:l0019)
   (sys:getlx (t 2) 1)
   (sys:jmp #:l0020)
   #:l0019
   (sys:getlx (t 2) 2)
   #:l0020
   (end (t 2)))

There is a (t 2) syntax for the t002 register, labels are uninterned symbols like #:l0019.

The assembler just works with that. It has an object for each opcode which knows how to encode it into the instruction buffer, plus logic for backpatching labels. When a forward jump is assembled, the label is added to a list of places needing backpatches along with a function to do it; later when the label is defined, the matching places are patched with the now known offset. The assembler contains very little stuff: buf is the buffer holding the output code (initially empty so it shows up as #b''). There is a bstr which is a stream over that buffer so we can do file-like I/O on the buffer. labref and labdef are hash tables for the label backpatching, and max-treg keeps track of the maximum T register number seen. The VM description emitted by the assembler will record this. When the VM is run, it just allocate only as many T registers on the stack as the code requires.