top | item 44917042

(no title)

sa46 | 6 months ago

Isn't gettimeofday implemented with vDSO to avoid kernel context switching (and therefore, most of the overhead)?

My understanding is that using tsc directly is tricky. The rate might not be constant, and the rate differs across cores. [1]

[1]: https://www.pingcap.com/blog/how-we-trace-a-kv-database-with...

discuss

order

toast0|6 months ago

I think most current systems have invariant tsc, I skimmed your article and was surprised to see an offset (but not totally shocked), but the rate looked the same.

You could cpu pin the thread that's reading the tsc, except you can't pin threads in OpenBSD :p

wahern|6 months ago

But just to be clear (for others), you don't need to do that because using RDTSC/RDTSCP is exactly how gettimeofday and clock_gettime work these days, even on OpenBSD. Where using the TSC is practical and reliable, the optimization is already there.

OpenBSD actually only implemented this optimization relatively recently. Though most TSCs will be invariant, they still need to be trained across cores, and there are other minutiae (sleeping states?) that made it a PITA to implement in a reliable way, and OpenBSD doesn't have as much manpower as Linux. Some of those non-obvious issues would be relevant to someone trying to do this manually, unless they could rely on their specific hardware behavior.

quotemstr|6 months ago

Wizardly workarounds for broken APIs persist long after those APIs are fixed. People still avoid things like flock(2) because at one time NFS didn't handle file locking well. CLOCK_MONOTONIC_RAW is fine these days with the vDSO.

denotational|6 months ago

Sadly GPFS still doesn’t support flock(2), so I still avoid it.

tonyarkles|6 months ago

It was a while ago (2009-10ish) but I ran into an exceptionally interesting performance issue that was partly identified with RDTSC. For a course project in grad school I was measuring the effects of the Python GIL when running multi-threaded Python code on multi-core processors. I expected the overhead/lock contention to get worse as I added threads/cores but the performance fell off a cliff in a way that I hadn't expected. Great outcome for a course project, it made the presentation way more interesting.

The issue ended up being that my multi-threaded code when running on a single core pinned that core at 100% CPU usage, as expected, but when running it across 4 cores it was running 4 cores at 25% usage each. This resulted in the clock governor turning down the frequency on the cores from ~2GHz to 900MHz and causing the execution speed to drop even worse than just the expected lock contention. It was a fun mystery to dig into for a while.

Dylan16807|6 months ago

If you have something newer than a pentium 4 the rate will be constant.

I'm not sure of the details for when cores end up with different numbers.

triknomeister|6 months ago

TSC is about cycles consumed by a core. Not about actual time. And so for microbenchmarking, it actually makes sense, because you are often much more interested in CPU benchmarks than network benchmarks in microbenchmarking.

ainiriand|6 months ago

You have to benchmark tsc against a fixed CPU speed, say 1000Mhz, then you have a reliable comparison.