top | item 45591665

(no title)

Trace data are sent through a large/fast port (PCIe or 60-pin connector) and captured using fast dedicated hardware at something like 10 GB per second. The trace data are usually compressed and often only need to indicate whether a branch is taken or not taken (TNT packets from x86, Arm has ETM but similar enough trace path) with a little bit of timing, exception/interrupt, and address overhead. The bottleneck is streaming and storing trace data from a hardware debugger (since its internal buffer is usually under half a second at max throughput) although you can further filter by application on Intel processors via CR3 matching. (Regarding the last five years of Apple: I'm not sure you'll find any info on Apple's debuggers and modifications to the Arm architecture. Ever.)

If you encounter a slowdown using RTIT or IPT (the old and new names for hardware trace) it's usually a single-digit percentage. (The sources here are Intel's vague documentation claims plus anecdotes; Magic Trace, Hagen Paul Pfeifer, Andi Kleen, Prelude Research.)

Decoding happens later and is significantly slower, and this is where the article's focus, JIT compilation, might be problematic using hardware trace (as instruction data might change/disappear, plus mapping machine code output to each Java instruction can be tricky).

discuss

scottgg|4 months ago

Thanks! I didn’t realise it’s common for CPUs to rock dedicated hardware for this.