Yes, this is why we (Netflix) default to tsc over the xen clocksource. I found the xen clocksource had become a problem a few years ago, quantified using flame graphs, and investigated using my own microbenchmark.
Honestly everyone should be defaulting to the TSC on modern x86. Timekeeping on a single OS image over the short term[1] is a hardware feature available at the ISA level now. It's not something to which the OS can add value, and as we see in circumstances like this it tends to muck things up trying to abstract it.
[1] Long term issues like inter-clock drift and global synchronization are a rather different problem area, and the OS has tools to help there.
Can you share if you needed to do anything to deal with time drift issues when using tsc? For my own systems, incorrect timestamps would cause a lot of issues.
Another option is to reduce usage of gettimeofday() when possible. It is not always free.
Roughly 10 years ago, when I was the driver author for one of the first full-speed 10GbE NICs, we'd get complaints from customers that were sure our NIC could not do 10Gbs, as iperf showed it was limited to 3Gb/s or less. I would ask them to re-try with netperf, and they'd see full bandwidth. I eventually figured out that the complaints were coming from customers running distros without the vdso stuff, and/or running other OSes which (at the time) didn't support that (Mac OS, FreeBSD). It turns out that the difference was that iperf would call gettimeofday() around every socket write to measure bandwidth. But netperf would just issue gettimeofday calls at the start and the end of the benchmark, so iperf was effectively gettimeofday bound. Ugh.
Syscalls in general far too often gets treated as if they're as cheap as function calls, with people often never profiling to see just how much they can affect throughput.
Apart from gettimeofday() other "favourites" of mine that people are often blind to include apps that do lots of unnecessary stat-ing of files, as well as tiny read()/write()'s instead of buffering in userspace.
The title is misleading. 77% slower sounds like the system calls take 1.77x the time on EC2. In fact, the results indicate that the normal calls are 77% faster - in other words, EC2 gettimeofday and clock_gettime calls take nearly 4.5x longer to run on EC2 than they do on ordinary systems.
This is a big speed hit. Some programs can use gettimeofday extremely frequently - for example, many programs call timing functions when logging, performing sleeps, or even constantly during computations (e.g. to implement a poor-man's computation timeout).
The article suggests changing the time source to tsc as a workaround, but also warns that it could cause unwanted backwards time warps - making it dangerous to use in production. I'd be curious to hear from those who are using it in production how they avoided the "time warp" issue.
> Some programs can use gettimeofday extremely frequently
This is what's usually considered the "root cause" of this problem, though. It's easy enough, if it's your own program, to wrap the OS time APIs to cache the evaluated timestamp for one event-loop (or for a given length of realtime by checking with the TSC.) Most modern interpreters/VM runtimes also do this.
Yeah like the php Xdebug extension. In 5.3 at least even with it just loaded and nothing enabled it called gettimeofday 1000s of times and would add seconds to web app render times for me (also on Xen with slow gettimeofday)
1) first, by eliminating the need for a context switch for libc calls such as gettimeofday(), gethrtime(), etc. (there is no public/supported interface on Solaris for syscalls, so libc would be used)
2) by providing additional, specific interfaces with certain guarantees:
This was accomplished by creating a shared page in which the time is updated in the kernel in a page that is created during system startup. At process exec time that page is mapped into every process address space.
Solaris' libc was of course updated to simply read directly from this memory page. Of course, this is more practical on Solaris because libc and the kernel are tightly integrated, and because system calls are not public interfaces, but this seems greatly preferable to the VDSO mechanism.
This is precisely what the vDSO does. The clocksources mentioned explicitly list themselves as not supporting this action, hence the fallback to a regular system call.
For those actually curious about the implementation on solaris/illumos, heres a quick rundown (from looking at current illumos source):
- comm_page (usr/src/uts/i86pc/ml/comm_page.s) is literally a page in kernel memory with specific variables that is mapped (usr/src/uts/intel/ia32/os/comm_page_util.c) as user|read-only (to be passed to userspace, kernel mapping is normal data, AFAICT)
- the mapped comm_page is inserted into the aux vector at AT_SUN_COMMPAGE (usr/src/uts/common/exec/elf/elf.c)
- libc scans auxv for this entry, and stashes the pointer it containts (usr/src/lib/libc/port/threads/thr.c)
- When clock_gettime is called, it looks at the values in the COMMPAGE (structure is in usr/src/uts/i86pc/sys/comm_page.h, probing in usr/src/lib/commpage/common/cp_main.c) to determine if TSC can be used.
- If TSC is usable, libc uses the information there (a bunch of values) to use tsc to read time (monotonic or realtime)
Variables within comm_page are treated like normal variables and used/updated within the kernel's internal timekeeping.
Essentially, rather than having the kernel provide an entry point & have the kernel know what the (in the linux case) internal data structures look like, here libc provides the code and reads the exported data structure from the kernel.
So it isn't reading the time from this memory page, it's using TSC. In the case of CLOCK_REALTIME, corrections that are applied to TSC are read from this memory page (comm_page).
Nitpick - `77 percent faster` is not the inverse of `77 percent slower`. The line that says `The results of this microbenchmark show that the vDSO method is about 77% faster` should read `446% faster`.
For anyone looking at the mentions of KVM "under some circumstances" having the same issue and wondering how to avoid it with KVM: KVM appears to support fast vDSO-based time calls as long as:
- You have a stable hardware TSC (you can check this in /proc/cpuinfo on the host, but all reasonably recent hardware should support this).
- The host has the host-side bits of the KVM pvclock enabled.
As long as you meet those two conditions, KVM should support fast vDSO-based time calls.
So… it's not that the syscalls are slower, it's that the Linux-specific mechanism the Linux kernel uses to bypass having to actually perform these calls does not currently work on Xen (and thus EC2).
Interesting way to find out the version of the hypervisor kernel. If the gtod call returns faster than the direct syscall for it, then you know the kernel version is prior to that of the patch fixing the issue in xen.
I expect there are many such patches that you could use to narrow down the version range of the host kernel. Once you've that information, you may be in a better position to exploit it, knowing which bugs are and are not patched.
Does anyone have any intuition around how this affects a variety of typical workflows? I imagine that these two syscalls are disproportionally likely to affect benchmarks more than real-world usage. How many times is this syscall happening on a system doing things like serving HTTP, or running batch jobs, or hosting a database, etc?
Go to your staging environment, use `strace -f -c -p $PID -e trace=clock_gettime` (or don't use -p and just launch the binary directly), replay a bit of production traffic against it, and then interrupt it and check the summary.
HTTP servers typically return a date header, often internally dates are used to figure out expiration and caching, and logging almost always includes dates.
It's incredibly easy to check the numbers of syscalls with strace, so you really should be able to get an intuition fairly easily by just playing around in staging.
Is this just an EC2 problem, or does it affect any Xen/KVM guest?
I ran the test program on a Hyper-V VM running CentOS 7 and got the same result: 100 calls to the gettimeofday syscall. Conversely, I tested a vSphere guest (also running CentOS 7), which didn't call gettimeofday at all.
It depends on a number of things. My DigitalOcean instance has this problem. The virtual machine I spun up on Oracle's Bare Metal Cloud platform doesn't (disclaimer, I work for the team).
You are not mistaken in that the topics are (somewhat) related, they all have to do with time. But setting the TZ environment variable doesn't mean your programs don't execute the syscalls discussed in this article.
This is about the speed of execution of the mentioned syscalls, which will be called regardless of the TZ environment variable, and how vDSO changes that. However, by setting the TZ environment variable you can avoid an additional call to stat to as it tries to determine if /etc/localtime exists.
How common are get time calls so that they would actually be an issue?
I've worked on quite a few systems and can't think of a time where an api for getting the time would have been called so much that it would affect performance?
Any application code that includes logging of any sort is going to grab time. All of my code (quite a private set and scientific in nature) calls time() at critical points with identifiers so I can easily investigate issues.
[+] [-] brendangregg|9 years ago|reply
Summarized details here:
https://www.slideshare.net/brendangregg/performance-tuning-e...
[+] [-] brendangregg|9 years ago|reply
[+] [-] ajross|9 years ago|reply
[1] Long term issues like inter-clock drift and global synchronization are a rather different problem area, and the OS has tools to help there.
[+] [-] aluminussoma|9 years ago|reply
[+] [-] drewg123|9 years ago|reply
Roughly 10 years ago, when I was the driver author for one of the first full-speed 10GbE NICs, we'd get complaints from customers that were sure our NIC could not do 10Gbs, as iperf showed it was limited to 3Gb/s or less. I would ask them to re-try with netperf, and they'd see full bandwidth. I eventually figured out that the complaints were coming from customers running distros without the vdso stuff, and/or running other OSes which (at the time) didn't support that (Mac OS, FreeBSD). It turns out that the difference was that iperf would call gettimeofday() around every socket write to measure bandwidth. But netperf would just issue gettimeofday calls at the start and the end of the benchmark, so iperf was effectively gettimeofday bound. Ugh.
[+] [-] tedunangst|9 years ago|reply
haha, it's amazing how much software is written that basically does something ridiculous like "while (gettimeofday()) clock_gettime();".
[+] [-] vidarh|9 years ago|reply
Apart from gettimeofday() other "favourites" of mine that people are often blind to include apps that do lots of unnecessary stat-ing of files, as well as tiny read()/write()'s instead of buffering in userspace.
[+] [-] LanceH|9 years ago|reply
[+] [-] nneonneo|9 years ago|reply
This is a big speed hit. Some programs can use gettimeofday extremely frequently - for example, many programs call timing functions when logging, performing sleeps, or even constantly during computations (e.g. to implement a poor-man's computation timeout).
The article suggests changing the time source to tsc as a workaround, but also warns that it could cause unwanted backwards time warps - making it dangerous to use in production. I'd be curious to hear from those who are using it in production how they avoided the "time warp" issue.
[+] [-] klodolph|9 years ago|reply
4.5x longer = 350% slower.
[+] [-] derefr|9 years ago|reply
This is what's usually considered the "root cause" of this problem, though. It's easy enough, if it's your own program, to wrap the OS time APIs to cache the evaluated timestamp for one event-loop (or for a given length of realtime by checking with the TSC.) Most modern interpreters/VM runtimes also do this.
[+] [-] lathiat|9 years ago|reply
[+] [-] binarycrusader|9 years ago|reply
1) first, by eliminating the need for a context switch for libc calls such as gettimeofday(), gethrtime(), etc. (there is no public/supported interface on Solaris for syscalls, so libc would be used)
2) by providing additional, specific interfaces with certain guarantees:
https://docs.oracle.com/cd/E53394_01/html/E54766/get-sec-fro...
This was accomplished by creating a shared page in which the time is updated in the kernel in a page that is created during system startup. At process exec time that page is mapped into every process address space.
Solaris' libc was of course updated to simply read directly from this memory page. Of course, this is more practical on Solaris because libc and the kernel are tightly integrated, and because system calls are not public interfaces, but this seems greatly preferable to the VDSO mechanism.
[+] [-] jdamato|9 years ago|reply
[+] [-] codyps|9 years ago|reply
- comm_page (usr/src/uts/i86pc/ml/comm_page.s) is literally a page in kernel memory with specific variables that is mapped (usr/src/uts/intel/ia32/os/comm_page_util.c) as user|read-only (to be passed to userspace, kernel mapping is normal data, AFAICT)
- the mapped comm_page is inserted into the aux vector at AT_SUN_COMMPAGE (usr/src/uts/common/exec/elf/elf.c)
- libc scans auxv for this entry, and stashes the pointer it containts (usr/src/lib/libc/port/threads/thr.c)
- When clock_gettime is called, it looks at the values in the COMMPAGE (structure is in usr/src/uts/i86pc/sys/comm_page.h, probing in usr/src/lib/commpage/common/cp_main.c) to determine if TSC can be used.
- If TSC is usable, libc uses the information there (a bunch of values) to use tsc to read time (monotonic or realtime)
Variables within comm_page are treated like normal variables and used/updated within the kernel's internal timekeeping.
Essentially, rather than having the kernel provide an entry point & have the kernel know what the (in the linux case) internal data structures look like, here libc provides the code and reads the exported data structure from the kernel.
So it isn't reading the time from this memory page, it's using TSC. In the case of CLOCK_REALTIME, corrections that are applied to TSC are read from this memory page (comm_page).
[+] [-] nwmcsween|9 years ago|reply
[+] [-] beagle3|9 years ago|reply
[+] [-] jdamato|9 years ago|reply
[1]: https://blog.packagecloud.io/eng/2016/04/05/the-definitive-g...
[+] [-] a_t48|9 years ago|reply
[+] [-] Rapzid|9 years ago|reply
[+] [-] clebio|9 years ago|reply
[+] [-] amluto|9 years ago|reply
[+] [-] JoshTriplett|9 years ago|reply
- You have a stable hardware TSC (you can check this in /proc/cpuinfo on the host, but all reasonably recent hardware should support this).
- The host has the host-side bits of the KVM pvclock enabled.
As long as you meet those two conditions, KVM should support fast vDSO-based time calls.
[+] [-] masklinn|9 years ago|reply
[+] [-] yellowapple|9 years ago|reply
[+] [-] andygrunwald|9 years ago|reply
[+] [-] chillydawg|9 years ago|reply
I expect there are many such patches that you could use to narrow down the version range of the host kernel. Once you've that information, you may be in a better position to exploit it, knowing which bugs are and are not patched.
[+] [-] nodesocket|9 years ago|reply
[+] [-] gtirloni|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] amluto|9 years ago|reply
There are patches floating around to support vDSO timing on Xen.
But isn't AWS moving away from Xen or are they just moving away from Xen PV?
[+] [-] apetresc|9 years ago|reply
[+] [-] TheDong|9 years ago|reply
Go to your staging environment, use `strace -f -c -p $PID -e trace=clock_gettime` (or don't use -p and just launch the binary directly), replay a bit of production traffic against it, and then interrupt it and check the summary.
HTTP servers typically return a date header, often internally dates are used to figure out expiration and caching, and logging almost always includes dates.
It's incredibly easy to check the numbers of syscalls with strace, so you really should be able to get an intuition fairly easily by just playing around in staging.
[+] [-] octo_t|9 years ago|reply
this will very likely be calling time related system calls, especially clock_gettime with CLOCK_MONOTONIC.
[+] [-] anonymous_iam|9 years ago|reply
[+] [-] xenophonf|9 years ago|reply
I ran the test program on a Hyper-V VM running CentOS 7 and got the same result: 100 calls to the gettimeofday syscall. Conversely, I tested a vSphere guest (also running CentOS 7), which didn't call gettimeofday at all.
[+] [-] Twirrim|9 years ago|reply
[+] [-] officelineback|9 years ago|reply
Looks like it's how the Xen hypervisor works.
[+] [-] MayeulC|9 years ago|reply
https://news.ycombinator.com/item?id=13697555
It seems very closely related, unless I am mistaken.
[+] [-] daenney|9 years ago|reply
This is about the speed of execution of the mentioned syscalls, which will be called regardless of the TZ environment variable, and how vDSO changes that. However, by setting the TZ environment variable you can avoid an additional call to stat to as it tries to determine if /etc/localtime exists.
[+] [-] pgaddict|9 years ago|reply
[+] [-] bandrami|9 years ago|reply
https://lwn.net/Articles/209101/
[+] [-] teddyuk|9 years ago|reply
I've worked on quite a few systems and can't think of a time where an api for getting the time would have been called so much that it would affect performance?
[+] [-] tyingq|9 years ago|reply
Apache and nginx for example, both call gettimeofday() a lot.
Edit: Quick google searches indicate software like redis and memcached also call it quite often.
[+] [-] jankedeen|9 years ago|reply
[+] [-] westbywest|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]