> Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
WASM compiled to a novel bytecode format aimed at efficient interpretation.
> Commonly occurring sequences of operations can can also be optimized into a "fused" operation.
Peephole optimizations producing fused opcodes, makes sense.
> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach
WASM translated to register-based bytecode. That's awesome!
> Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers.
Sure this is neat stuff, but I don't think any of it is novel. Bochs is a good source for some bytecode vm performance wizardry [1], even if the bytecode in question is the x86 ISA.
Regardless, kudos to the authors and nice to see a fast wasm interpreter done well.
IR getting converted into an interpreter-oriented bytecode is pretty common. Mono does it for its interpreter and IIRC, Spidermonkey has historically done that as well. I'm not sure if V8 has ever interpreted from an IR (maybe now?) but you could view their original 'baseline JIT' model as converting into an interpreter-focused IR, where the IR just happened to be extremely unoptimized x86 assembly.
Translating the stack machine into registers was always a core part of the model but it's interesting to me that even interpreters are doing it. The necessity of doing coloring to assign registers efficiently is kind of unfortunate, I feel like the WASM compiler would have been the right place to do this offline.
> "WASM translated to register-based bytecode. That's awesome!"
If the hardware executing this code is "stack-based" (or, does not offer enough general purpose registers to accomodate the funtion call) - this will need to be converted back to a stack-based function call (either at runtime, or beforehand). Wouldn't this intermediate WASM-to-register-based-bytecode translation be redundant then?
Any JIT needs an interpreter to run code sections that are not compiled (yet), and large sections (I don't have any statistics, but I would guess a very sizable majority) of the code is 'cold' and will never get compiled at all. Compiling bytecode to machine code is a relatively expensive operation in itself so it only makes sense to do it for 'hot' sections (loops, functions that get called many times, etc).
Besides that it can help with startup time. If the script that needs to be executed is a one-off thing the removal of compilation time might be save more time than a JIT could save during execution.
From what I know, its because JIT uses a lot more memory than an interpreter. Also in LUAJIT it uses very little memory and thats why people have their minds blown all the time.
You can see an example of this particular implementation style (where each operation is a tail call to a C function, passing the registers as arguments) at the second link above, under "continuation-passing style".
One of the big advantages of a threaded interpreter is relatively good branch prediction. A simple switch-based dispatch loop has a single indirect jump at its core, which is almost entirely unpredictable -- whereas threaded dispatch puts a copy of that indirect jump at the end of each opcode's implementation, giving the branch predictor way more data to work with. Effectively, you're letting it use the current opcode to help predict the next opcode!
> Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.
It appears that this relies on tail-call optimization to avoid overflowing the stack. Unfortunately this means you probably can't run it in debug mode.
It's not that bad even in debug mode (or without TCO). Just not optimal. Also, there is a way to rework this part, so it does not rely on compiler TCO.
Impressive list of constrained targets for embedded. The AtMega1284 microcontroller for example has only 16 KB of RAM. Which is a lot for an 8-bit micro, but pretty standard for a modern application processors.
[+] [-] ridiculous_fish|6 years ago|reply
> Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
WASM compiled to a novel bytecode format aimed at efficient interpretation.
> Commonly occurring sequences of operations can can also be optimized into a "fused" operation.
Peephole optimizations producing fused opcodes, makes sense.
> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach
WASM translated to register-based bytecode. That's awesome!
> Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers.
This is some black magic, if it works!
[+] [-] vnorilo|6 years ago|reply
Regardless, kudos to the authors and nice to see a fast wasm interpreter done well.
1: http://www.emulators.com/docs/nx25_nostradamus.htm
[+] [-] kevingadd|6 years ago|reply
Translating the stack machine into registers was always a core part of the model but it's interesting to me that even interpreters are doing it. The necessity of doing coloring to assign registers efficiently is kind of unfortunate, I feel like the WASM compiler would have been the right place to do this offline.
[+] [-] uasm|6 years ago|reply
If the hardware executing this code is "stack-based" (or, does not offer enough general purpose registers to accomodate the funtion call) - this will need to be converted back to a stack-based function call (either at runtime, or beforehand). Wouldn't this intermediate WASM-to-register-based-bytecode translation be redundant then?
[+] [-] CharlesW|6 years ago|reply
[+] [-] ridiculous_fish|6 years ago|reply
2. Some platforms prohibit creating new executable pages, which prevents JITing.
3. Memory savings!
[+] [-] w0utert|6 years ago|reply
[+] [-] Matthias247|6 years ago|reply
[+] [-] vijaybritto|6 years ago|reply
[+] [-] kbumsik|6 years ago|reply
[+] [-] Koshkin|6 years ago|reply
[+] [-] setheron|6 years ago|reply
Tbh, I couldn't get the eureka moment though. Might try to read in the AM ;)
[+] [-] thermals|6 years ago|reply
https://en.wikipedia.org/wiki/Threaded_code
http://www.complang.tuwien.ac.at/forth/threaded-code.html
You can see an example of this particular implementation style (where each operation is a tail call to a C function, passing the registers as arguments) at the second link above, under "continuation-passing style".
One of the big advantages of a threaded interpreter is relatively good branch prediction. A simple switch-based dispatch loop has a single indirect jump at its core, which is almost entirely unpredictable -- whereas threaded dispatch puts a copy of that indirect jump at the end of each opcode's implementation, giving the branch predictor way more data to work with. Effectively, you're letting it use the current opcode to help predict the next opcode!
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] 29athrowaway|6 years ago|reply
It would be interesting to see how this is designed for security in mind.
[+] [-] DarthGhandi|6 years ago|reply
Struggling to see it.
[+] [-] kick|6 years ago|reply
[+] [-] MuffinFlavored|6 years ago|reply
https://github.com/wasm3/wasm3/blob/master/test/benchmark/co...
59.5x faster than node.js at what? Executing WebAssembly?
[+] [-] vshymanskyy|6 years ago|reply
[+] [-] haberman|6 years ago|reply
> Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.
It appears that this relies on tail-call optimization to avoid overflowing the stack. Unfortunately this means you probably can't run it in debug mode.
[+] [-] vshymanskyy|6 years ago|reply
[+] [-] jononor|6 years ago|reply
[+] [-] vshymanskyy|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]