I did experiment with a few different dispatch methods before settling on the one in Bolt now, though not with tailcalls specifically. The approach I landed on was largely chosen cause it in my testing competes with computed goto solutions while also compiling on msvc, but I'm absolutely open to try other things out.
UncleEntity|6 months ago
The VM I've been poking at is I/O bound so the difference (probably) isn't even measurable over the overhead of reading a file. I went with a pure 'musttail' implementation but didn't do any sort of performance measurements so who knows if it's better or not.
mananaysiempre|6 months ago
(Whether the blob uses computed gotos or loop-switch is less important these days, because Clang [but not GCC] is often smart enough to actually replicate your dispatch in the loop-switch case, avoiding the indirect branch prediction problem that in the past meant computed gotos were preferable. You do need to verify that this optimization actually happens, though, because it can be temperamental sometimes[1].)
By contrast, tail calls with the most important interprerer variables turned into function arguments (that are few enough to fit into registers per the ABI—remember to use regparm or fastcall on x86-32) give the compiler the opportunity to allocate registers for each bytecode’s body separately. This usually allows it to do a much better job, even if putting the cold path out of line is still advisable. (Somehow I’ve never thought to check if it would be helpful to also mark those functions preserve_none on Clang. Seems likely that it would be.)
[1] https://blog.nelhage.com/post/cpython-tail-call/
nolist_policy|6 months ago
http://www.emulators.com/docs/nx25_nostradamus.htm