top | item 44858028

(no title)

beariish | 6 months ago

I did experiment with a few different dispatch methods before settling on the one in Bolt now, though not with tailcalls specifically. The approach I landed on was largely chosen cause it in my testing competes with computed goto solutions while also compiling on msvc, but I'm absolutely open to try other things out.

discuss

UncleEntity|6 months ago

From my research into the subject the easiest way to implement it would be a 'musttail' macro which falls back to a trampoline for compilers which don't support it. The problem then becomes having the function call overhead (assuming the compiler can't figure out what's going on and do tail-call optimizations anyway) on the unsupported systems with each and every opcode which is probably slower than just a Big Old Switch -- which, apparently, modern compilers are pretty good at optimizing.

The VM I've been poking at is I/O bound so the difference (probably) isn't even measurable over the overhead of reading a file. I went with a pure 'musttail' implementation but didn't do any sort of performance measurements so who knows if it's better or not.

mananaysiempre|6 months ago

There’s one thing that tail calls do that no other approach to interpreters outside assembly really can, and that is decent register allocation. Current compilers only ever try to allocate registers for a function at a time, and somehow that invariably leads them to do a bad job when given a large blob of a single intepreter function. This is especially true if you don’t isolate your cold paths into separate functions marked uninlineable (and preferably preserve_all or the like). Just look at the assembly and you’ll usually find that it sucks.

(Whether the blob uses computed gotos or loop-switch is less important these days, because Clang [but not GCC] is often smart enough to actually replicate your dispatch in the loop-switch case, avoiding the indirect branch prediction problem that in the past meant computed gotos were preferable. You do need to verify that this optimization actually happens, though, because it can be temperamental sometimes[1].)

By contrast, tail calls with the most important interprerer variables turned into function arguments (that are few enough to fit into registers per the ABI—remember to use regparm or fastcall on x86-32) give the compiler the opportunity to allocate registers for each bytecode’s body separately. This usually allows it to do a much better job, even if putting the cold path out of line is still advisable. (Somehow I’ve never thought to check if it would be helpful to also mark those functions preserve_none on Clang. Seems likely that it would be.)

[1] https://blog.nelhage.com/post/cpython-tail-call/

nolist_policy|6 months ago

Take look at the Nostradamus Distributor:

http://www.emulators.com/docs/nx25_nostradamus.htm