top | item 20389554

(no title)

earlz | 6 years ago

> One would think, for example, that it would make sense to do the "instruction decode" pass ahead of time, to end up with an array of pointers-to-instructions plus literal-values... but the resulting representation of the code is usually much larger in memory, and so less of it will fit in cache (and it'll also fight the VM interpreter itself for cache-lines.) You might gain from your instruction impls not having to trampoline back to the interpreter (https://en.wikipedia.org/wiki/Threaded_code, basically), but you'll lose in cache coherence.

As someone currently writing a (subset of an) x86 VM, I feel this pain entirely too much. My subset greatly simplifies things by not using segment registers and getting to (mostly) not have to implement the 16bit version of ModR/M.

The biggest problem with x86 is the sheer number of ways to do addressing within a single opcode using a Mod R/M operand. For instance, all of these lines of assembly can use the same primary opcode:

    push eax 
    push [eax] 
    push [1000] 
    push [1000 + eax] 
    push [0x11 + (eax * 2 + ecx)]
    push [0x11223344 + (eax * 8 + ecx)]
    push [(eax * 8 + ecx) - 10]

If not for the Mod R/M and SIB operands, x86 would be a static-width instruction set

I'm building an interpreter that goes the way of decoding to build a pipeline (really a "basic block") and then to execute the entire pipeline with minimal branching across execution. I'm less afraid of excessive cache use than the unpredictable indirect branch problem. The hope is that building a pipeline and executing it with a branchless unrolled loop will allow me to avoid that problem, while also greatly simplifying the implementation of each opcode where the actual logic just receives a set of operands it can get or set.

discuss

No comments yet.