top | item 42555286

(no title)

olliej | 1 year ago

I think my favorite introduction to just how insane the pipelines, predictors, and other insanity in "modern" (more than a decade ago now) was trying to improve `Math.sqrt()` performance in JSC. This was during the first generation of JIT JS engines (e.g. no one was inlining functions yet), and I was replacing the host implementation of Math.sqrt with a pure assembly version - essentially calling a host function was significantly more expensive another JS function - e.g. JIT JS function -> JIT JS function was significantly faster than JIT JS function -> host code (e.g. C/C++). As part of doing that I was going step by step through each instruction making sure it was the minimum overhead as each step, think something like (very approximate - again more than a decade ago):

  v0:
    1. if (input not a number)
       fallback to C++; else
    2. return tagged 0; // Just making sure the numeric check was optimal

  v1:
    1. As above
    2. If integer
       convert to float
    3. return tagged 0

  v2:
    1-2. as above
    3. If negative
       return tagged nan
    4. Return tagged 0

  v3:
    1-3. as above
    4. use the sqrt instruction
    5. return tagged 0

  v4.
    1-4. as above
    5. move <4> back to an integer register
    6. return tagged 0

  v5.
    1-5. as above
    6. tag the result of sqrt
    7. return tagged 0

  v6.
    1-6. as above
    7. Actually return/store the result of <6>

Alas I cannot recall whether at this point return values were going into the heap allocated VM call stack, or whether the return was via rax, but that's not the bit that was eye opening to me.

I had a benchmark that was something like

    for (var i = 0; i < large number; i++)
       Math.sqrt(i)

Noting that while I was working on this there was no meaningful control flow analysis, inlining, etc so this was an "effective" benchmark for perf work at the time - it would not be so today.

The performance remained "really good" (read fake) until the `v6` version that actually store/returned the result of the work. It was incredibly eye opening to see just how much code could be "executed" before the CPU actually ended up doing any work, and significantly impacted my approach to dealing with codegen in future.

My perspective at the time was "I know there's a significant marshaling cost to calling host code, and I know the hardware sqrt is _very_ fast", so it seemed that it was possible that a 5-10x perf improvement seemed "plausible" to me at the time (because marshaling was legitimately very expensive) - and I can't recall where in the 5-10x range the perf improvement was - but then once the final store/return was done it dropped in perf to only 2x faster. Which was still a big win, but also seeing just how much work the CPU could just avoid doing while trying to build out the code was a significant learning experience.

discuss

No comments yet.