top | item 33717498

(no title)

I agree with your point, but I want to point out that Deegen also provided APIs to hide all the details of inline caching. The bytecode only specifies the body lambda and the effect lambdas, and Deegen lowers it to an efficient implementation automatically, including exotic optimizations that fuses the cached IC effect into the opcode so one indirect branch can be avoided.

If LuaJIT interpreter were to employ IC, it would have to undergo a major rewrite (due to its assembly nature) to have the equally efficient code as LJR (that we generate automatically). This is one advantage of our meta-compiler approach.

Finally, although no experiment is made, my subjective opinion is already in the article:

> LuaJIT’s hand-written assembly interpreter is highly optimized already. Our interpreter generated by Deegen is also highly optimized, and in many cases, slightly better-optimized than LuaJIT. However, the gain from those low-level optimizations are simply not enough to beat LuaJIT by a significant margin, especially on a modern CPU with very good instruction-level parallelism, where having a few more instructions, a few longer instructions, or even a few more L1-hitting loads have negligible impact on performance. The support of inline caching is one of the most important high-level optimizations we employed that contributes to our performance advantage over LuaJIT.

That is, if you compare the assembly between LJR and LuaJIT interpreter, I believe LJR's assembly is slightly better (though I would anticipate only marginal performance difference). But that's also because we employed a different boxing scheme (again...). If you force LJR to use LuaJIT's boxing scheme, I guess the assembly code would also be similar since LJR's hand-written assembly is already optimal or at least very close to optimal.

discuss

No comments yet.