Consider this example for dispatching instructions from the article:
while (running)
{
uint16_t op = mem_read(reg[R_PC]++) >> 12;
switch (op)
{
case OP_ADD:
{ADD, 6}
break;
case OP_AND:
{AND, 7}
break;
case OP_NOT:
{NOT, 7}
break;
case OP_BR:
{BR, 7}
break;
...
}
}
That code has a lot of branching. The switch statement has to jump to the corresponding case, the break statement branches to the bottom, and then there is third branch to get back to the top of the while loop. Three branches just to hit one instruction.
Now there is only one branch per instruction. The handler for each instruction directly jumps to the next location via the goto. There is no need to be in an explicit loop because the interpreter runs until it hits a halting instruction.
Many VMs now use this technique, including the canonical Ruby and Python interpreters.
For the longest time, I swore by computed goto as well. But it has its share of problems; it's not very portable; it forces the code into a rigid, non-extendable format; and it's not as efficient as commonly assumed. I'm far from the first person to notice [0], so don't bother shooting the messenger.
My latest project [1] simply calls into a struct via a function pointer for each instruction. With a twist. Since it returns the pointer to the next operation and uses longjmp to stop the program, which means I can get away without lookups and without a loop condition. And as an added bonus the code is much nicer to deal with and more extendable, since the set of operations isn't hardcoded any more.
Comparing language performance is tricky business, but it mostly runs faster than Python3 from my tests so far. My point is that computed goto is not the end of the story.
I'll just add that there are many different ways to write a VM; the one posted here is very low level, the one linked below reasonably high level. Using physical CPUs as inspiration is all good, but there's nothing gained from pretending in itself unless you're compiling to native code at some point.
> That code has a lot of branching. The switch statement has to jump to the corresponding case, the break statement branches to the bottom, and then there is third branch to get back to the top of the while loop. Three branches just to hit one instruction.
That's a bit unfair. Not all branches are equal. Only the instruction fetch branch is going to be often mispredicted. Predicted branches, like that while loop, aren't that expensive. Mispredicted branches cost 10-20x more.
Of course less branches and less code in general is better.
One big issue writing interpreters in C/C++ is that compiler register allocation can't usually follow the data flow, and needs to keep unnecessarily loading and storing from/to memory same common variables.
Interpreters need to also be careful not to exceed 32 kB L1 code cache limits.
All this means to write a truly efficient interpreter, you'll need to do it in assembler.
The step after that is to write a simple JIT that does away with data dependent (= VM instruction) branches altogether.
Then you'll notice you don't need to update some VM registers every time, but can coalesce for example program counter updates to certain points.
Eventually you'll find you have a full fledged JIT compiler doing instruction scheduling and register allocation, etc.
Been down that rabbit hole, except for the last step. That's where it becomes a true challenge.
LuaJIT (http://luajit.org/) project followed all the way through, and studying it is a great resource for anyone interested on the topic. Kudos to Mike Pall.
I like to go one step further and use a first pass to convert bytecodes to a list of labels. Use a pointer in this list as the program counter. Then the dispatch becomes as simple as:
goto *pc++;
Here's an example of the technique for a Brainfuck VM implemented in the Ethereum VM assembly. (a recent competition entry of mine):
This technique also makes it easy to do some peephole optimizations and combine multiple bytecodes into a single label. The above example also does a lot of that.
Brainfuck is actually a great language to built a VM for. You have a simple interpreter up in no time, but you can go really far in optimizing it. LC3 looks comparatively messy (more state, more instructions, etc), but more representative of real CPUs.
I googled and found a test[1]. Running this test with a few modifications (like removing unused variables and using PRIu64 instead of %llu) and disregarding the last two prints, I get:
It'll be even faster without indirecting through dispatch_table at runtime. Exporting your label addresses without introducing a conditional initialization block is tricky but doable. For C++ you can (IIRC) smuggle dispatch_table out of the routine using a constructor on a statically-scoped object. For C code I've successfully used GCC's nested function and __attribute__((constructor, used)) extensions, though that was with 4.7 (may not work with current GCC).
I mostly use computed gotos for implementing poor-man's coroutines in C, where the addresses only need to be visible within the routine. For VMs I use the obvious method (i.e. dispatch_table) to keep the code sane, but it does incur a non-negligible performance cost, which may matter in some contexts.
> the break statement branches to the bottom, and then there is third branch to get back to the top of the while loop
That suggests your C compiler isn't doing jump threading: replacing the jump to an unconditional jump instruction with a direct jump to that instruction's destination.
That's a very basic optimization I might expect even from a C compiler written in 1980.
Thanks! I was not familiar with this technique. What do you think about the C++ table of function pointers towards the end? I imagine that would be slower with the additional level of indirection, but I haven't profiled or examined the assembly.
I once implemented a VM in Ada and just used a large switch. I extensively benchmarked it and it was blazingly fast at -O3.
However, it was only fast when I used packages very sparingly. In contrast to the usual advice given in the Ada community, splitting up the implementation into several packages slowed down the main loop tremendously. I suspect this wouldn't happen with whole-program optimization in C, but believe that the version of gcc I was using didn't support that for Ada. Also, my Green Threads were slower than a single thread, no matter which tricks I tried.
It's an abandoned project now, since the accompanying assembler was hacked together in Racket and at some point I simply lost track of what was going on where :O
I tested this recently (Delphi XE6), and it definitely is as-described: you get straight jump instructions with enough case statement branches, and it is very fast.
Barry Kelly worked on the Delphi compiler, and I believe he comments here on Hacker News occasionally.
I love these kinds of tutorials, and really any tutorial that makes me think of stuff that was previously "magic" and makes me feel stupid for not previously understanding it after I read it (I honestly mean that in a positive way).
This will be a fun weekend project for me...I have had an idea of a lambda-calculus-based VM that I've wanted to build for a few months ago, and I think this will be a good start for me to understand it.
I like this tutorial but it should be mentioned that before implementing a virtual machine one should understand that there are many computing models and many alternative mechanisms within each of them. Implementing VM for sequential program execution is relatively easy. What is more (conceptually) difficult is concurrency, asynchronous processes etc.
Author here. Definitely agree. The intent was to write the simplest possible to VM that could run real programs. VMs are obviously a deep field of research and there are plenty more projects to learn from.
Thats one of the main reasons i implemented my VM in go.
Its fairly rudimentary, a map for instructions to function pointers but.. I can quickly duplicate a subtree of code and execute it within a different goroutine, and wrap that section of code in an instruction that sets the output to a channel, and have another goroutine select the different channels from different goroutines.
At each node within the tree of code, i store the variables within a map instead of using a stack-based system; if the data is not in the current node, look down the tree until i find it.
I agree it's a great exercise. Years ago I added a very simple stack-based VM to my raytracer to allow procedural textures & normal maps. The scene script would e.g. contain a material description like this:
i.e. an expression with 3-vectors and scalars and a few basic functions (noise, sin/cos, etc.). This can be easily "compiled" (during script parsing) for execution on the VM. Then the overhead during actual raytracing was quite small.
> ... Definitional interpreters, compilers, and other models of computation are defined in a systematic way as Prolog programs, and as a result, formal descriptions become running prototypes that can be tested and modified ... These programs can be extended in straightforward ways into tools such as analyzers, tracers and debuggers.
Also "Logic Programming and Compiler Writing" By David Warren (and the work that followed on.)
If you want to play with the algorithms at a high level [1], there are two Python AST -> bytecode implementations I've seen, and one that I'm using.
(1) The 'compiler' module from Python 2. I'm using this to build my shell project Oil, so it works.
See Building Oil with the OPy Bytecode Compiler: http://www.oilshell.org/blog/2018/03/04.html . The heavily refactored source is in the opy/compiler2 directory of the repo on Github.
(This module was removed in Python 3 due to its redundancy with the C implementation. But there's no conceptual difference between Python 2 and 3. Some small details have changed.)
They are written in very different styles. Tailbiter reads like Lisp code, which may or may not be what you want.
The generated bytecode is different in the sense that the values it operates on aren't just ints and floats (which hardware deals with), but they are dynamically typed objects. But this doesn't change the overall algorithm to generate bytecode from an AST.
[1] which I recommend over doing it in C or C++, because ASTs are
somewhat annoying in those languages
The assembler accepts programs in a kind of Lisp syntax. It allows for symbolic labels and performs all the backpatching. The binary instruction format is written out with the help of the buf data type and some ffi-related functions for reading and writing binary types to and from a buffer.
Because the assembly code a Lisp data structure, the compiler can emit the instruction templates using backquote templates. Pieces of code can be catenated with append since they are just a list and so on.
The compiler comp-if method compiles (if ...) forms. The most general case with all three arguments (if expr then else) uses this code template:
The first element ,[star]te-frag.code splices in the code part of the fragment obtained from compiling the test expression. That code has to be run first. Then we emit the (if ...) instruction which tests the te-frag.oreg: the output register where the te-frag.code has stored the value. If the test is successful, the instructions continue below, otherwise a branch takes place to the else part. The else part is identified by lelse which holds the label: the ,lelse backquote syntax inserts that label into the template. So after the (if ...) instruction we splice the th-frag.code: the "then" fragment. After that we jump past the else part to whatever follows via (jmp ,lskip): jump to the skip label. Before this, we potentially insert a mov instruction that may be needed. We are expected to produce the result value of the (if ...) expression in an output register held in oreg. But the th-frag.code puts the result into th-frag.oreg: its own output register. Now those two may be the same. If they are not the same register, then a move is needed: the maybe-mov function produces the code when the registers are different, or else an empty list.
In action:
This is the TXR Lisp interactive listener of TXR 203.
Quit with :quit or Ctrl-D on empty line. Ctrl-X ? for cheatsheet.
1> (compile-toplevel '(if x y z))
** warning: (expr-1:1) unbound variable x
** warning: (expr-1:1) unbound variable y
** warning: (expr-1:1) unbound variable z
#<sys:vm-desc: 9739370>
2> (disassemble *1)
data:
syms:
0: x
1: y
2: z
code:
0: A0020000 getlx t002 0
1: 48000005 if t002 5
2: 00000002
3: A0020001 getlx t002 1
4: 44000006 jmp 6
5: A0020002 getlx t002 2
6: 10000002 end t002
instruction count:
6
#<sys:vm-desc: 9739370>
Here, getlx t002 0 means look up the lexical variable named by the symbol at position [0] in the syms table, in other words x. Put the value into register t002. Then if t002 5 means, if t002 is true (non-nil), then continue, else branch to the instruction at offset 5.
end t002 means that the VM is done executing and its result value is in t002.
We can intercept the call to the assembler asm method:
There is a (t 2) syntax for the t002 register, labels are uninterned symbols like #:l0019.
The assembler just works with that. It has an object for each opcode which knows how to encode it into the instruction buffer, plus logic for backpatching labels. When a forward jump is assembled, the label is added to a list of places needing backpatches along with a function to do it; later when the label is defined, the matching places are patched with the now known offset.
The assembler contains very little stuff: buf is the buffer holding the output code (initially empty so it shows up as #b''). There is a bstr which is a stream over that buffer so we can do file-like I/O on the buffer. labref and labdef are hash tables for the label backpatching, and max-treg keeps track of the maximum T register number seen. The VM description emitted by the assembler will record this. When the VM is run, it just allocate only as many T registers on the stack as the code requires.
chrisaycock|7 years ago
https://eli.thegreenplace.net/2012/07/12/computed-goto-for-e...
Consider this example for dispatching instructions from the article:
That code has a lot of branching. The switch statement has to jump to the corresponding case, the break statement branches to the bottom, and then there is third branch to get back to the top of the while loop. Three branches just to hit one instruction.Now imagine we had written the above as:
Now there is only one branch per instruction. The handler for each instruction directly jumps to the next location via the goto. There is no need to be in an explicit loop because the interpreter runs until it hits a halting instruction.Many VMs now use this technique, including the canonical Ruby and Python interpreters.
sifoobar|7 years ago
My latest project [1] simply calls into a struct via a function pointer for each instruction. With a twist. Since it returns the pointer to the next operation and uses longjmp to stop the program, which means I can get away without lookups and without a loop condition. And as an added bonus the code is much nicer to deal with and more extendable, since the set of operations isn't hardcoded any more.
Comparing language performance is tricky business, but it mostly runs faster than Python3 from my tests so far. My point is that computed goto is not the end of the story.
I'll just add that there are many different ways to write a VM; the one posted here is very low level, the one linked below reasonably high level. Using physical CPUs as inspiration is all good, but there's nothing gained from pretending in itself unless you're compiling to native code at some point.
[0] http://www.emulators.com/docs/nx25_nostradamus.htm
[1] https://gitlab.com/sifoo/snigl
vardump|7 years ago
That's a bit unfair. Not all branches are equal. Only the instruction fetch branch is going to be often mispredicted. Predicted branches, like that while loop, aren't that expensive. Mispredicted branches cost 10-20x more.
Of course less branches and less code in general is better.
One big issue writing interpreters in C/C++ is that compiler register allocation can't usually follow the data flow, and needs to keep unnecessarily loading and storing from/to memory same common variables.
Interpreters need to also be careful not to exceed 32 kB L1 code cache limits.
All this means to write a truly efficient interpreter, you'll need to do it in assembler.
The step after that is to write a simple JIT that does away with data dependent (= VM instruction) branches altogether.
Then you'll notice you don't need to update some VM registers every time, but can coalesce for example program counter updates to certain points.
Eventually you'll find you have a full fledged JIT compiler doing instruction scheduling and register allocation, etc.
Been down that rabbit hole, except for the last step. That's where it becomes a true challenge.
LuaJIT (http://luajit.org/) project followed all the way through, and studying it is a great resource for anyone interested on the topic. Kudos to Mike Pall.
remcob|7 years ago
https://g.solidity.cc/submissions/wicketh.eth/3bb28842856230...
This technique also makes it easy to do some peephole optimizations and combine multiple bytecodes into a single label. The above example also does a lot of that.
Brainfuck is actually a great language to built a VM for. You have a simple interpreter up in no time, but you can go really far in optimizing it. LC3 looks comparatively messy (more state, more instructions, etc), but more representative of real CPUs.
sebcat|7 years ago
switch = 5589926 goto = 5752079 goto_opt = 5618938
with -O0, and
switch = 2234105 goto = 2013742 goto_opt = 2016200
with -O2.
EDIT:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
[1] https://gist.github.com/mmozeiko/7cc858985b57df30eebbf8c6e75...
wahern|7 years ago
I mostly use computed gotos for implementing poor-man's coroutines in C, where the addresses only need to be visible within the routine. For VMs I use the obvious method (i.e. dispatch_table) to keep the code sane, but it does incur a non-negligible performance cost, which may matter in some contexts.
sgillen|7 years ago
kazinator|7 years ago
That suggests your C compiler isn't doing jump threading: replacing the jump to an unconditional jump instruction with a direct jump to that instruction's destination.
That's a very basic optimization I might expect even from a C compiler written in 1980.
justinmeiners|7 years ago
emilfihlman|7 years ago
zeusk|7 years ago
DSingularity|7 years ago
jonathanstrange|7 years ago
However, it was only fast when I used packages very sparingly. In contrast to the usual advice given in the Ada community, splitting up the implementation into several packages slowed down the main loop tremendously. I suspect this wouldn't happen with whole-program optimization in C, but believe that the version of gcc I was using didn't support that for Ada. Also, my Green Threads were slower than a single thread, no matter which tricks I tried.
It's an abandoned project now, since the accompanying assembler was hacked together in Racket and at some point I simply lost track of what was going on where :O
henrikeh|7 years ago
Optimizing your project: https://www.pegasoft.ca/resources/boblap/7.html
GNAT: Alphabetical list of all switches: https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gnat_ugn/Alphabetic...
TimJYoung|7 years ago
https://stackoverflow.com/a/2548425
I tested this recently (Delphi XE6), and it definitely is as-described: you get straight jump instructions with enough case statement branches, and it is very fast.
Barry Kelly worked on the Delphi compiler, and I believe he comments here on Hacker News occasionally.
tombert|7 years ago
This will be a fun weekend project for me...I have had an idea of a lambda-calculus-based VM that I've wanted to build for a few months ago, and I think this will be a good start for me to understand it.
techno_modus|7 years ago
JustSomeNobody|7 years ago
Know any good resources for these?
justinmeiners|7 years ago
dana321|7 years ago
Its fairly rudimentary, a map for instructions to function pointers but.. I can quickly duplicate a subtree of code and execute it within a different goroutine, and wrap that section of code in an instruction that sets the output to a channel, and have another goroutine select the different channels from different goroutines.
At each node within the tree of code, i store the variables within a map instead of using a stack-based system; if the data is not in the current node, look down the tree until i find it.
gattr|7 years ago
all2|7 years ago
If you're comfortable sharing, that is.
delinka|7 years ago
morazow|7 years ago
Does anyone know how I can convert AST into stream of bytecodes? Are there any good example language implementations to learn?
carapace|7 years ago
http://www.ep.liu.se/ecp/012/004/ecp012004.pdf
> ... Definitional interpreters, compilers, and other models of computation are defined in a systematic way as Prolog programs, and as a result, formal descriptions become running prototypes that can be tested and modified ... These programs can be extended in straightforward ways into tools such as analyzers, tracers and debuggers.
Also "Logic Programming and Compiler Writing" By David Warren (and the work that followed on.)
chubot|7 years ago
(1) The 'compiler' module from Python 2. I'm using this to build my shell project Oil, so it works.
See Building Oil with the OPy Bytecode Compiler: http://www.oilshell.org/blog/2018/03/04.html . The heavily refactored source is in the opy/compiler2 directory of the repo on Github.
(This module was removed in Python 3 due to its redundancy with the C implementation. But there's no conceptual difference between Python 2 and 3. Some small details have changed.)
(2) tailbiter, which I mentioned here: http://www.oilshell.org/blog/2018/03/27.html
They are written in very different styles. Tailbiter reads like Lisp code, which may or may not be what you want.
The generated bytecode is different in the sense that the values it operates on aren't just ints and floats (which hardware deals with), but they are dynamically typed objects. But this doesn't change the overall algorithm to generate bytecode from an AST.
[1] which I recommend over doing it in C or C++, because ASTs are somewhat annoying in those languages
chrisseaton|7 years ago
stevekemp|7 years ago
kazinator|7 years ago
The code is very clean, and fairly suitable for study purposes. It is not commented, though.
http://www.kylheku.com/cgit/txr/tree/share/txr/stdlib/compil...
http://www.kylheku.com/cgit/txr/tree/share/txr/stdlib/asm.tl
http://www.kylheku.com/cgit/txr/tree/vm.c
The assembler accepts programs in a kind of Lisp syntax. It allows for symbolic labels and performs all the backpatching. The binary instruction format is written out with the help of the buf data type and some ffi-related functions for reading and writing binary types to and from a buffer.
Because the assembly code a Lisp data structure, the compiler can emit the instruction templates using backquote templates. Pieces of code can be catenated with append since they are just a list and so on.
The compiler comp-if method compiles (if ...) forms. The most general case with all three arguments (if expr then else) uses this code template:
The first element ,[star]te-frag.code splices in the code part of the fragment obtained from compiling the test expression. That code has to be run first. Then we emit the (if ...) instruction which tests the te-frag.oreg: the output register where the te-frag.code has stored the value. If the test is successful, the instructions continue below, otherwise a branch takes place to the else part. The else part is identified by lelse which holds the label: the ,lelse backquote syntax inserts that label into the template. So after the (if ...) instruction we splice the th-frag.code: the "then" fragment. After that we jump past the else part to whatever follows via (jmp ,lskip): jump to the skip label. Before this, we potentially insert a mov instruction that may be needed. We are expected to produce the result value of the (if ...) expression in an output register held in oreg. But the th-frag.code puts the result into th-frag.oreg: its own output register. Now those two may be the same. If they are not the same register, then a move is needed: the maybe-mov function produces the code when the registers are different, or else an empty list.In action:
Here, getlx t002 0 means look up the lexical variable named by the symbol at position [0] in the syms table, in other words x. Put the value into register t002. Then if t002 5 means, if t002 is true (non-nil), then continue, else branch to the instruction at offset 5.end t002 means that the VM is done executing and its result value is in t002.
We can intercept the call to the assembler asm method:
#<sys:vm-desc: 972a210>Here the code looks like:
There is a (t 2) syntax for the t002 register, labels are uninterned symbols like #:l0019.The assembler just works with that. It has an object for each opcode which knows how to encode it into the instruction buffer, plus logic for backpatching labels. When a forward jump is assembled, the label is added to a list of places needing backpatches along with a function to do it; later when the label is defined, the matching places are patched with the now known offset. The assembler contains very little stuff: buf is the buffer holding the output code (initially empty so it shows up as #b''). There is a bstr which is a stream over that buffer so we can do file-like I/O on the buffer. labref and labdef are hash tables for the label backpatching, and max-treg keeps track of the maximum T register number seen. The VM description emitted by the assembler will record this. When the VM is run, it just allocate only as many T registers on the stack as the code requires.
wener|7 years ago
Writing a VM is very fun and addictive. Especially writing in different language can learn a lot!
elgfare|7 years ago
ngcc_hk|7 years ago
unknown|7 years ago
[deleted]