Do you have a source for “with very little observer effect”? I don’t know better, it just seems like a big assumption the CPU can emit all this extra stuff without behaving differently.
Trace data are sent through a large/fast port (PCIe or 60-pin connector) and captured using fast dedicated hardware at something like 10 GB per second. The trace data are usually compressed and often only need to indicate whether a branch is taken or not taken (TNT packets from x86, Arm has ETM but similar enough trace path) with a little bit of timing, exception/interrupt, and address overhead. The bottleneck is streaming and storing trace data from a hardware debugger (since its internal buffer is usually under half a second at max throughput) although you can further filter by application on Intel processors via CR3 matching. (Regarding the last five years of Apple: I'm not sure you'll find any info on Apple's debuggers and modifications to the Arm architecture. Ever.)
If you encounter a slowdown using RTIT or IPT (the old and new names for hardware trace) it's usually a single-digit percentage. (The sources here are Intel's vague documentation claims plus anecdotes; Magic Trace, Hagen Paul Pfeifer, Andi Kleen, Prelude Research.)
Decoding happens later and is significantly slower, and this is where the article's focus, JIT compilation, might be problematic using hardware trace (as instruction data might change/disappear, plus mapping machine code output to each Java instruction can be tricky).
It's not an assumption, this is based on claims made by CPU manufactures. It's possible to get it down to within 1-2% overhead.
Intuitively this works because the hardware can just spend some extra area to stream the info off on the side of the datapath -- it doesn't need to be in the critical path.
PennRobotics|4 months ago
If you encounter a slowdown using RTIT or IPT (the old and new names for hardware trace) it's usually a single-digit percentage. (The sources here are Intel's vague documentation claims plus anecdotes; Magic Trace, Hagen Paul Pfeifer, Andi Kleen, Prelude Research.)
Decoding happens later and is significantly slower, and this is where the article's focus, JIT compilation, might be problematic using hardware trace (as instruction data might change/disappear, plus mapping machine code output to each Java instruction can be tricky).
scottgg|4 months ago
achierius|4 months ago
Intuitively this works because the hardware can just spend some extra area to stream the info off on the side of the datapath -- it doesn't need to be in the critical path.