Register-based VMs have a higher performance than stack-based VMs

[+] lacampbell|9 years ago|reply

I would love to see the code used for some of these instructions. Their description of "swap" for the stack VM jumped out at me:

> pop 1 st and 2nd value out of local stack and swap them

You don't need to pop at all for swap. The stack isn't actually changing size and can be modified in place. You also should only have one pop for 'add', etc. A lot of people don't seem to realise this.

Also, this article may be of interest:

https://blogs.msdn.microsoft.com/ericlippert/2011/11/28/why-...

TL;DR - a stack machine is conceptually simpler than a register machine, and it doesn't matter if it's slower since you are JIT'ing it anyway.

[+] kjksf|9 years ago|reply

It does matter even for jitting.

Pike and Winterbottom (http://www.vitanuova.com/inferno/papers/hotchips.html) used register-based VM for Plan9/Inferno because it's much faster and easier to write a high-quality jit from register-based VM to register-based CPU (and all of them are register based) than from byte-code VM to register-based CPU.

This is probably the same reason Android chose register-based VM design for their Java-based platform.

Jitting is not free - an increase in jit complexity is paid by lower performance of the code being jitted. Java's HotSpot does wonders but it takes much latency and high memory use to get the really high-quality results.

It makes sense to do as much work as possible where you have time and CPU cycles to spare (during compilation to VM instruction set) and not on the device when every millisecond and every byte matters.

Lower complexity of stack-based VM is a moot point - when you're jitting, you're doing the hard stuff anyway.

[+] jamescostian|9 years ago|reply

The last page of the publication has a GitHub link. Here are the relevant lines (I think):

https://github.com/Conceptual-Inertia/Conceptum/blob/master/...

[+] naasking|9 years ago|reply

> TL;DR - a stack machine is conceptually simpler than a register machine, and it doesn't matter if it's slower since you are JIT'ing it anyway

It totally matters. See [1] for previous work on this topic. You need to perform expensive data flow optimisations to generate information the stack format doesn't need to encode, but that a register format does need. Better to do this analysis ahead of time in the compiler than at runtime.

[1] https://www.usenix.org/events/vee05/full_papers/p153-yunhe.p...

[+] Merad|9 years ago|reply

IIRC, strictly speaking stack machines are not allowed to modify values in place on the stack. Any decent JIT or interpreter is likely to recognize that it can optimize certain operations by performing them in place, but the bytecode/IL/whatever will always be "pop two values and swap them".

[+] lazyjones|9 years ago|reply

> You don't need to pop at all for swap.

Your skepticism is justified: https://github.com/frjalex/Conceptum/blob/master/src/main.c#...

(note that this doesn't implement a swap by any means)

[+] white-flame|9 years ago|reply

Many stack systems boast about the large number of operations they can perform per second. However, a stack "operation" is not equivalent to a register machine operation.

Something like "reg1 = reg2 + 125" is a single instruction in almost every register ABI, while at they very, very least, a stack machine would need two instructions: "push 125, add". If reg1 isn't immediately accessible as the stack head, and reg2 isn't consumed as the head, then you also need load/store/swap/roll/etc stuff wrapping the addition.

That's up to 6 (or even more) stack instructions to perform what a register machine can do in 1.

Also, when talking about interpreters, there is a not-insignificant overhead of fetching and dispatching the next instruction. That overhead is continually compounded when you need multiple instructions to perform what you could do in a single operation otherwise. The simpler the instructions are, the higher percentage of time is "wasted" performing instruction dispatch instead of carrying out core instruction processing. This can be parallelized in hardware instruction dispatch, but is pretty firmly stuck being serial in software interpreters.

Stack machines have very concise source code representations, and their individual instructions are simple and fast. But they can be very bloated in terms of the number of native instructions executed to carry out an overall program's work.

[+] vilda|9 years ago|reply

You seem to be confused with theoretical stack-based machines with their clear implementation, and their practical implementations. There's no reason why stack-based machine cannot have one instruction to add a constant, for example "addc 125" to add 125 to the top of the stack.

Similarly, there's no reason why register-based VMs wouldn't have push/pop instructions.

[+] Someone|9 years ago|reply

The main result is hardly surprising. https://www.usenix.org/legacy/events/vee05/full_papers/p153-... (2005) measured the speed difference at 26.5-32.3%. Given the subject, that's not significantly different from the 20.39% reported here (and why do they even report this with two decimals? I'm sure that last digit will change if they do so much as sneeze at their code)

[+] throwaway4891a|9 years ago|reply

Yup. Stack-based is great for high-level language VM interface ease-of-use (ie JVM languages) but generally sucks at the hardware level because most CPUs are register-based: the mapping doesn't scale as well because of the extra overhead of optimal scheduling. A good register-based VM would have a large number of registers/(pseudo)variables and support sequence points to allow actual instructions to be more optimally-scheduled. Atomics are also nice in certain edge-cases, but being mostly immutable / lock-free / branch-free is typically a better strategy to avoid contention and pipeline stalls.

[+] dfox|9 years ago|reply

I see few flaws in their approach:

The stack based virtual machine represents instructions as two field structs that are essentially predecoded (and larger in memory) while the register VM uses what boils down to array of words which are then presumably decoded at run time (the paper even contains some kind of argument why the code is represented as array of single-field structs, but completely neglects to mention what would happen if the structs contained some kind of pre-decoded instructions).

The function call mechanism of their stack based VM seems to be essentially modeled after how forth works, which is not how typical stack based VM used in implementation of languages that do not directly expose the VM semantics works. Typical stack based VM (eg. Smalltalk, JVM, CPython...) allocates new argument stack for each function invocation (often using alloca() or such, ie. on native control stack) and does not directly use the VM argument stack to pass parameters and results between functions.

[+] lacampbell|9 years ago|reply

How do most stack based VM represent instructions? When I went through a phase of writing VMs, the approach I used was to represent everything with a single byte. if a "push" byte was encountered, it would scan ahead 8 more instructions and construct an int. So it had the overhead of having to decode and encode actual data, but the instruction stream was very dense.

I imagine in the real world, they use something like

    union instruction_packet {
        int64 value;
        int8[8] instructions;
    }

Which would be less space efficient (when you hit a push instruction you'd need to move to the next packet, even if it was the first element of the array) but gets rid of the decoding/encoding problem.

[+] bogomipz|9 years ago|reply

Forth a register-based VM right?

Does it the VM stack as well instead of the native stack?

[+] jamesu|9 years ago|reply

Having implemented both a stack and a register-based VM I'd agree with the sentiment in the title. The register-based implementation (based partly off LUA 5 bytecode) was notably faster which I attributed to improved use of CPU cache combined with the instruction doing more work in per memory-fetch.

However I'd also point out the register-based VMs was far more difficult to debug, plus the compiler was much more long-winded to account for register allocation. Plus by the time you add on function dispatch overhead and real-world usage patterns, the gap where the register machine is technically better (i.e. besides just doing fibonacci) narrows somewhat.

[+] userbinator|9 years ago|reply

Not surprising. Register-based machines in general have higher performance than stack-based machines, because the register set can be addressed randomly like RAM whereas in a 'strict' stack machine this isn't possible.

https://en.wikipedia.org/wiki/Stack_machine#Performance_disa...

Of course the advantages such as easy code generation and high code density make stack machines a good intermediate compilation target, which then gets compiled to a register machine for actual execution.

[+] bogomipz|9 years ago|reply

>"Of course the advantages such as easy code generation and high code density make stack machines a good intermediate compilation target, which then gets compiled to a register machine for actual execution"

Are you referring to a specific language here or are you saying all register machine always compile to stack machines first?

[+] transfire|9 years ago|reply

I should imagine implementing a register-based VM on a register-based CPU is going to have some advantages over a stack-based VM implemented on a register-based CPU. I wonder how well a register-based VM would fair on a stack-based CPU.

[+] shincert|9 years ago|reply

That's interesting. Are there any actual stack-based CPUs out there today?

[+] bogomipz|9 years ago|reply

>"I should imagine implementing a register-based VM on a register-based CPU is going to have some advantages over a stack-based VM implemented on a register-based CPU."

Can you elaborate why is that? What would those advantages be in practical terms?

[+] Lerc|9 years ago|reply

I remember reading a paper a while ago with similar conclusions.

It concluded that register was faster than stack based but stack based had better code density. The reason seemed to be due to the branch prediction cost per instruction. Fewer but more more powerful instructions reduced the number of times the instruction decode operation occurred.

[+] sitkack|9 years ago|reply

My feeling is that has more to do with memory traffic and the cache hierarchy than the difference between register and stack VMs. Stack VMs are going to generate more read-modify-write cycles than a register VM. Of course experiments are in order (or out of order, ha!).

[+] questerzen|9 years ago|reply

I imagine you are right. But there should be plenty of scope for optimising this in the VM itself. Most stack operations are immediately preceded by a push, so hold the top of the stack in a register and the last push in another and you could avoid a high proportion of memory calls in the most common cases. Does anyone know if JVM implementations do this?

[+] questerzen|9 years ago|reply

From the Wikipedia page on the JVM: "[code verification]...allowing the JIT compiler to transform stack accesses into fixed register accesses. Because of this, that the JVM is a stack architecture does not imply a speed penalty for emulation on register-based architectures when using a JIT compiler."

[+] e3b0c|9 years ago|reply

http://fpgacpu.ca/stack/Second-Generation_Stack_Computer_Arc...

The article has detailed discussions about the topic.

[+] erikb|9 years ago|reply

This reminds me a little of the "nosql is faster than sql" argument a few years back, or "Firefox is faster than Chrome". Let's just add a "now" and we're fine. Most programs we have now are so complex that none of them is really optimized to the max. Therefore the question is basically who spends more of his ressources on optimization.

[+] iopq|9 years ago|reply

But some problems are very tough to solve in some languages. Like in Java, it's very hard to avoid allocating everything on the heap. In C++/Rust it's trivial so you can get that performance boost without crude hacks.

[+] CalChris|9 years ago|reply

To quote another post:

> generally sucks at the hardware level because most CPUs are register-based

Whoa, hold on there folks. This is not a CPU register VM. It's a 'register-based VM' and it isn't even that.

  iconst 1    vs    set t3, t1
  iconst 2          set t4, t2
  iadd              add t3, t4, t5

t1 through t5 are allocated in the stack frame. They are NOT magically somehow mapped to x86_64 registers R8-12.

Actually, calling this a register-based VM is just wrong. It's lazy thinking but really it's just wrong. The value of t1 will be in memory.

Yes, it's standard terminology which easily lets people misunderstand what's happening under the hood, as in the above quoted case. The article doesn't even say virtual registers which would have helped. It could. It should. It doesn't.

BTW, a good interpreter will cache the top of stack in a CPU register.

[+] hyperpape|9 years ago|reply

I believe this is a standard bit of terminology: it means that the virtual machine is implemented with virtual registers. Google "lua registers", for instance.

[+] frjalex|9 years ago|reply

There's a specific issue in implementing / caching the top of the stack with registers. (I'm assuming that you're talking about stuffs related to C's `register` keyword here) CPU registers don't have memory addresses so it would be basically impossible to achieve that, as in the stack machine implementation of this paper, void pointers are stored in stacks

70 comments