You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!
Me: looks at the resulting flamegraph. "what the hell is this?!?!?"
I've found all kinds of crazy stuff in codebases this way. Static initializers that aren't static, one-line logger calls that trigger expensive serialization, heavy string-parsing calls that don't memoize patterns, etc. Unfortunately some of those are my fault.
I also like icicle graphs for this. They're flamegraphs, but aggregated in the reverse order. (I.e. if you have calls A->B->C and D->E->C, then both calls to C are aggregated together, rather than being stacked on top of B and E respectively. It can make it easier to see what's wrong when you have a bunch of distinct codepaths that all invoke a common library where you're spending too much time.)
Regular flamegraphs are good too, icicle graphs are just another tool in the toolbox.
Also cool that when you open it in a new tab, the svg [0] is interactive! You can zoom in by clicking on sections, and there's a button to reset the zoom level.
I always found profiling performance critical code and experimenting with optimisations to be one of the most enjoyable parts of development - probably because of the number of surprises that I encountered ("Why on Earth is that so slow?").
I might be very wrong in every way but, string parsing and or manipulating and memoiziation... sound like a super strange combo? For the first you know you're already doing expensive allocations, but the 2nd is also not a pattern I really see apart from in JS codebases. Could you provide more context on how this actually bit you in the behind? memoizing strings seems like a complicated and error prone "welp it feels better now" territory in my mind so I'm genuinely curious.
Author here. After my last post about kernel bugs, I spent some time looking at how the JVM reports its own thread activity. It turns out that "What is the CPU time of this thread?" is/was a much more expensive question than it should be.
I don't think it is possible to talk about fractions of nanoseconds without having an extremely good idea of the stability and accuracy of your clock. At best I think you could claim there is some kind of reduction but it is super hard to make such claims in the absolute without doing a massive amount of prep work to ensure that the measured times themselves are indeed accurate. You could be off by a large fraction and never know the difference. So unless there is a hidden atomic clock involved somewhere in these measurements I think they should be qualified somehow.
Thanks for the write-up Jaromir :) For those interested, I explored memory overhead when reading /proc—including eBPF profiling and the history behind the poorly documented user-space ABI.
Hi Jonas, thanks for the work on OpenJDK and the post! I swear I hadn't seen your blog :) I finished my draft around Christmas and it’s been in the queue since. Great minds think alike, I guess.
edit: I just read your blog in full and I have to say I like it more than mine. You put a lot more rigor into it. I’m just peeking into things.
Why do you suppose it was originally written the way it was? To my eyes, that seems like a horrible approach. Doing file IO and parsing strings in every call? What?! And yet I assume the original author was a smart person who had a reason why this made sense to them, and my inability to guess why is my own limitation and not theirs.
> Click to zoom, open in a new tab for interactivity
I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.
Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.
Only for some clocks (CLOCK_MONOTONIC, etc) and some clock sources. For VIRT/SCHED, the vDSO shim still has to invoke the actual syscall. You can't avoid the kernel transition when you need per-thread accounting.
If you look below the vDSO frame, there is still a syscall. I think that the vDSO implementation is missing a fast path for this particular clock id (it could be implemented though).
I really wished™ there was an API/ABI for userland- and kernelland-defined individual virtual files at arbitrary locations, backed by processes and kernel modules respectively. I've tried pipes, overlays, and FUSE to no avail. It would greatly simply configuration management implementations while maintaining compatibility with the convention of plain text files, and there's often no need to have an actual file on any media or the expense of IOPS.
While I don't particularly like the IO overhead and churn consequences of real files for performance metrics, I get the 9p-like appeal of treating the virtual fs as a DBMS/API/ABI.
It took seven years to address this concern following the initial bug report (2018). That seems like a lot, considering how instrumenting CPU time can be in the hot path for profiled code.
"look, I'm sorry, but the rule is simple:
if you made something 2x faster, you might have done something smart
if you made something 100x faster, you definitely just stopped doing something stupid"
Does anyone knowledgeable know whether it’s possible to drastically reduce the overhead of reading from procfs? IIUC everything in it is in-memory, so there’s no real reason reading some data should take the order of 10us.
Obviously a vdso read is going to be significantly faster than a syscall switching to the kernel, writing serialized data to a buffer, switching back to userland, and parsing that data.
I don't think I've ever seen less than 10x speedup after putting some effort into improving performance of "organic"/legacy code. It's always shocking how slow code can be before anyone complains.
ot|1 month ago
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
jerrinot|1 month ago
nly|1 month ago
Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
Tbh I thought clock_gettime was a vdso based virtual syscall anyway
unknown|1 month ago
[deleted]
mgaunard|1 month ago
shermantanktop|1 month ago
Me: looks at my code. "sure, ok, looks alright."
Me: looks at the resulting flamegraph. "what the hell is this?!?!?"
I've found all kinds of crazy stuff in codebases this way. Static initializers that aren't static, one-line logger calls that trigger expensive serialization, heavy string-parsing calls that don't memoize patterns, etc. Unfortunately some of those are my fault.
wging|1 month ago
Regular flamegraphs are good too, icicle graphs are just another tool in the toolbox.
tempaccsoz5|1 month ago
[0]: https://questdb.com/images/blog/2026-01-13/before.svg
arethuza|1 month ago
jabwd|1 month ago
sroerick|1 month ago
jerrinot|1 month ago
jacquesm|1 month ago
Neywiny|1 month ago
6r17|1 month ago
edit : I had an afterthought about this because it ended up being a low quality comment ;
Bringing up such TLDR give a lot of value to reading content, especially on HN, as it provides way more inertia and let focus on -
reading this short form felt like that cool friend who gave you a heads up.
abicklefitch|1 month ago
[deleted]
jonasn|1 month ago
Thanks for the write-up Jaromir :) For those interested, I explored memory overhead when reading /proc—including eBPF profiling and the history behind the poorly documented user-space ABI.
Full details in my write-up: https://norlinder.nu/posts/User-CPU-Time-JVM/
jerrinot|1 month ago
edit: I just read your blog in full and I have to say I like it more than mine. You put a lot more rigor into it. I’m just peeking into things.
edit2: I linked your article from my post.
kstrauser|1 month ago
So, why do you reckon they did that?
furyofantares|1 month ago
> Click to zoom, open in a new tab for interactivity
I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.
jerrinot|1 month ago
Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.
pjmlp|1 month ago
Very interesting read.
higherhalf|1 month ago
jerrinot|1 month ago
ot|1 month ago
a-dub|1 month ago
here it gets the task struct: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... and here https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... to here where it actually pulls the value out: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/sched...
where here is the vdso clock pick logic https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get... and here is the fallback to the syscall if it's not a vdso clock https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get...
goodroot|1 month ago
Love the people and their software.
Great blog Jaromir!
burnt-resistor|1 month ago
While I don't particularly like the IO overhead and churn consequences of real files for performance metrics, I get the 9p-like appeal of treating the virtual fs as a DBMS/API/ABI.
otterley|1 month ago
loeg|1 month ago
Ono-Sendai|1 month ago
https://x.com/rygorous/status/1271296834439282690
ee99ee|1 month ago
squirrellous|1 month ago
mgaunard|1 month ago
xthe|1 month ago
nomel|1 month ago
amelius|1 month ago
tomiezhang|1 month ago