It appears to be fixed in Linux 3.4 [1]. According to the original commit [2] it's been broken since 7dffa3c673fbcf835cd7be80bb4aec8ad3f51168 [3], which appeared in 2.6.26.
So, kernels between 2.6.26 and 3.3 (inclusive) are vulnerable.
Google uses a "leap smear" and slowly accounts for the leap second before it happens.[1] As long as you are not doing any astronomical calculations or constrained by regulatory requirements I think google has the right idea.
As part of Google Compute Engine we provide an NTP server to the guest which is based on Google Production time. As such our VMs get to take advantage of this leap second smearing implementation. I was going to mention this at my talk at IO but forgot.
Not surprising. In spite of all press that Y2K was just a silly waste of money, its events like these that makes me suspect it would have been a much bigger deal if everyone had ignored it and fixed it after things where shown to break.
A lot of engineers[1] spent a lot of time successfully fixing Y2K bugs.
Because nothing well known blew up, many people wrongly assumed that Y2K was never a real problem to begin with.
[1] I moved a Fortune 100 manufacturing company's database off an ancient mainframe that would've been disastrous come Y2K. It went smoothly and was thus a thankless job. They paid well though (mid six figures - those were the days).
Why does everyone always say Y2K wasn't an issue ? I'm sure there were a lot of consultant making too much money with little work - however _alot_ of bug fixes were done, that would have caused problems. So because it was taken seriously , stuff were fixed and issues didn't happen because of that.
Personally, I fixed 3 Y2K bugs back then, 2 of them would have brought down a rather critical business support to simply crash every time new data arrived.
It seems to be the unique class of bug that not only is it easy to forget to test, and won't ever show up until a particular date... but then affects everyone!
I can't think of any other kind of bug that never shows up ever, but then affects everyone. Rare bugs tend to stay rare, common bugs tend to get caught before they affect everyone... this is the exception.
From discussion of this same issue in prior threads, my takeaway was
(a) it's really not at all difficult to handle leap seconds, but
(b) the POSIX standard specifically disallows them, by specifying that a day must contain exactly 86400 seconds. (Analogously, imagine if leap days occurred as normal, but a "year" by definition contained exactly 365 days.)
The existence of leap seconds means that it's not possible to simultaneously have (1) system time representing the number of seconds since the epoch, and (2) system time equal to (86400 * number_of_days_since_epoch) + seconds_elapsed_today, and all the proposed methods of dealing with the problem involve preserving (2), which seems worthless to me, and throwing away (1), which I would have thought was a better model.
edit: actual system times may be in units other than seconds, but the point remains
In this case, i believe we created a problem we did not have. Leap seconds is a dubious construct from the start, problematic with computers or space travel. We have added only 25 since 1972. Their unpredictability means they will be forever a problem with computing. We should either quit the whole idea or in the worst case allow them only every 25 years or so.
Fear the Unix 32-bit time-becomes-negative bugs, in 2037.
We have 25 years to get ready. I still think we'll be patching at the last minute.
(Yeah, lots of systems will be 64-bit by then, but there will still be a lot of embedded crackerbox systems running 32-bit timestamps. It's all the embedded stuff I'm worried about).
It's 2038, not 2037.[1] (Specifically, January 19th, 2038 at 3:14:08am.) And while lots of systems will be 64-bit, many programs still won't be -- and it seems highly likely that this will be a significantly more serious and widespread problem than, say, Y2K or DST. (And certainly more serious than leap seconds, which happen relatively frequently.) Then again, I might be biased: perhaps I'm secretly hoping to spend the years leading up to 2038 paying for my retirement with high-priced consulting gigs to fix it...
If you think that being 64-bit protects you, then you do not understand the problem.
The problem is that 32-bit time is embedded in filesystem representations and related protocols. (eg the POSIX specification for file times in tar.) Therefore even if your machine is 64-bit, it still needs to use 32-bit time for many purposes.
To name a random example, the POSIX specification for times in the tar format is 32-bit. GNU tar has a non-standard extension that already takes care of it. But will everything else that expects to read/write tar files that a GNU tar program implement the same non-standard extension to the format in the same non-standard way? Almost certainly not. And there will be no sign of disaster until the second that we need to start relying on that more precise representation.
SLE9 (kernel 2.6.5-7.325): NOT AFFECTED
SLE10-SP1 (kernel 2.6.16.54-0.2.12): NOT AFFECTED
SLE10-SP2 (kernel 2.6.16.60-0.42.54.1): NOT AFFECTED
SLE10-SP3 (kernel 2.6.16.60-0.83.2): NOT AFFECTED
SLE10-SP4 (kernel 2.6.16.60-0.97.1): NOT AFFECTED
SLE11-GA (kernel 2.6.27.54-0.2.1): VERY UNLIKELY
SLE11-SP1 (kernel 2.6.32.59-0.3.1): VERY UNLIKELY
SLE11-SP2 (kernel 3.0.31-0.9.1): VERY UNLIKELY
Update (06/26/2012): after thorough code review -> SLE9 and SLE10 not affected at all.
Pardon the ignorance if this is a stupid question. I've been looking at some of my hosts and have noticed a message "Clock: inserting leap second 23:59:60 UTC" in dmesg output but each of the hosts is in the EDT timezone so the I was under the impression that the leap second hadn't been applied yet. So what does that mean? That the systems have applied the leap second successfully or have only received it from their NTP servers?
I was logged on to a couple of CentOS 6 servers when I saw this happen, and on each one the Java processes went absolutely haywire. Everything else seemed to work fine.
I attempted to fix with adjtimex and the script in the linked question, but to no avail, in the end having to restart them all instead. After that, all was good again.
Two days ago while booting, the BIOS time on my eeepc was suddenly reset, with an error message on boot to adjust the time manually. Was just thinking that it may be related?
Our Linux instances running on Amazon EC2 had no issues since we are not running ntpd on these servers and adjtimex returns status as 64 (clock unsynchronized).
Stupid question: Why was this not caught? Seems pretty easy to test. Just set the clock to today (or any day with a leap second), and watch what happens.
> Just set the clock to today (or any day with a leap second), and watch what happens.
That won't work. The bug is only triggered when an upstream NTP server reports that a leap second was scheduled. Since leap seconds aren't predictable (and aren't even scheduled very far in advance), just setting the time back to the date of a previous leap second won't do anything.
If really all of the Linux where affected more than half of the Internet would be still down by now. Could be only a specific combination of kernel/userspace bugs that only exists in some systems.
What a bit sucks is that my VPN was affected to (openvpn) causing my computer to do a poweroff.
I replaced the poweroff with
ip route add to 192.168.1.0/24 dev lo
hope that saves me when the next leap second occurs.
[+] [-] __david__|13 years ago|reply
So, kernels between 2.6.26 and 3.3 (inclusive) are vulnerable.
[1] https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....
[2] https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....
[3] https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....
[+] [-] moe|13 years ago|reply
Spent the last two hours recovering servers, tomorrow will be another interesting day.
Whoever figured it'd be a good idea to INSERT[1] the leap-second instead of just slowing/accelerating time... <censored>
[1] Clock: inserting leap second 23:59:60 UTC
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] glawatscheck|13 years ago|reply
[+] [-] doki_pen|13 years ago|reply
[+] [-] dfc|13 years ago|reply
[1] http://googleblog.blogspot.com/2011/09/time-technology-and-l...
[+] [-] jbeda|13 years ago|reply
[+] [-] ChuckMcM|13 years ago|reply
[+] [-] pud|13 years ago|reply
Because nothing well known blew up, many people wrongly assumed that Y2K was never a real problem to begin with.
[1] I moved a Fortune 100 manufacturing company's database off an ancient mainframe that would've been disastrous come Y2K. It went smoothly and was thus a thankless job. They paid well though (mid six figures - those were the days).
[+] [-] noselasd|13 years ago|reply
Personally, I fixed 3 Y2K bugs back then, 2 of them would have brought down a rather critical business support to simply crash every time new data arrived.
[+] [-] duiker101|13 years ago|reply
P.S. for people wanting to know more this video is simple to understand but really amazing http://www.youtube.com/watch?v=xX96xng7sAE
[+] [-] crazygringo|13 years ago|reply
It seems to be the unique class of bug that not only is it easy to forget to test, and won't ever show up until a particular date... but then affects everyone!
I can't think of any other kind of bug that never shows up ever, but then affects everyone. Rare bugs tend to stay rare, common bugs tend to get caught before they affect everyone... this is the exception.
[+] [-] thaumasiotes|13 years ago|reply
(a) it's really not at all difficult to handle leap seconds, but
(b) the POSIX standard specifically disallows them, by specifying that a day must contain exactly 86400 seconds. (Analogously, imagine if leap days occurred as normal, but a "year" by definition contained exactly 365 days.)
The existence of leap seconds means that it's not possible to simultaneously have (1) system time representing the number of seconds since the epoch, and (2) system time equal to (86400 * number_of_days_since_epoch) + seconds_elapsed_today, and all the proposed methods of dealing with the problem involve preserving (2), which seems worthless to me, and throwing away (1), which I would have thought was a better model.
edit: actual system times may be in units other than seconds, but the point remains
[+] [-] mbq|13 years ago|reply
[+] [-] spindritf|13 years ago|reply
[+] [-] zerostar07|13 years ago|reply
Edit: In fact there is strong indication that they may be abolished: http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish...
[+] [-] andreasvc|13 years ago|reply
[+] [-] kabdib|13 years ago|reply
We have 25 years to get ready. I still think we'll be patching at the last minute.
(Yeah, lots of systems will be 64-bit by then, but there will still be a lot of embedded crackerbox systems running 32-bit timestamps. It's all the embedded stuff I'm worried about).
[+] [-] bcantrill|13 years ago|reply
[1] http://en.wikipedia.org/wiki/Year_2038_problem
[+] [-] btilly|13 years ago|reply
The problem is that 32-bit time is embedded in filesystem representations and related protocols. (eg the POSIX specification for file times in tar.) Therefore even if your machine is 64-bit, it still needs to use 32-bit time for many purposes.
To name a random example, the POSIX specification for times in the tar format is 32-bit. GNU tar has a non-standard extension that already takes care of it. But will everything else that expects to read/write tar files that a GNU tar program implement the same non-standard extension to the format in the same non-standard way? Almost certainly not. And there will be no sign of disaster until the second that we need to start relying on that more precise representation.
[+] [-] kzk_mover|13 years ago|reply
At first, you can confirm the status flag like this.
8209's binary representation is like this. This surely have INS bit "100000000[1]0001" (5th LSB). 8193 is the value after the clearance of the INS big. Then, let's set it as a current value. Please ensure your ntpd is not running.[+] [-] MrUnderhill|13 years ago|reply
[+] [-] brongondwana|13 years ago|reply
[+] [-] shaggy|13 years ago|reply
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] DEinspanjer|13 years ago|reply
[+] [-] piggity|13 years ago|reply
Running on a 3.2 kernel
Rebooted them all and they're fine.
[+] [-] sehugg|13 years ago|reply
[+] [-] kristopher|13 years ago|reply
[+] [-] wiredfool|13 years ago|reply
This fixed it:
[+] [-] politician|13 years ago|reply
[+] [-] mootothemax|13 years ago|reply
I attempted to fix with adjtimex and the script in the linked question, but to no avail, in the end having to restart them all instead. After that, all was good again.
[+] [-] cagenut|13 years ago|reply
[+] [-] scottbruin|13 years ago|reply
[+] [-] glawatscheck|13 years ago|reply
stop ntpd, run ntpdate or sntp, start ntpd
/etc/init.d/ntp stop; sntp -s <ntpserver>; /etc/init.d/ntp start
Unfortunately sntp / ntpdate wrapper is not shipped with squeeze for example. I've used the binary from SuSE 11.4 just fine on squeeze.
[+] [-] glawatscheck|13 years ago|reply
apt-get install ntpdate; /etc/init.d/ntp stop; ntpdate pool.ntp.org; /etc/init.d/ntp start
[+] [-] raverbashing|13 years ago|reply
My Debian GNU/Linux 6.0 is still standing
Oh well, reading the issue, the machine date is Sat Jun 30 16:11:31 EDT 2012
Stopped ntpd just in case
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] rbanffy|13 years ago|reply
[+] [-] yaix|13 years ago|reply
[+] [-] sayeed|13 years ago|reply
I think the Xen host takes care of the synchronization and we need not do it in the guest OS. (see http://serverfault.com/questions/100978/do-i-need-to-run-ntp...).
Is this fine or should we run ntpd for better accuracy?
[+] [-] csarva|13 years ago|reply
[+] [-] arohner|13 years ago|reply
[+] [-] duskwuff|13 years ago|reply
That won't work. The bug is only triggered when an upstream NTP server reports that a leap second was scheduled. Since leap seconds aren't predictable (and aren't even scheduled very far in advance), just setting the time back to the date of a previous leap second won't do anything.
[+] [-] cullenking|13 years ago|reply
/etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date;
[+] [-] chmod775|13 years ago|reply
What a bit sucks is that my VPN was affected to (openvpn) causing my computer to do a poweroff. I replaced the poweroff with
ip route add to 192.168.1.0/24 dev lo
hope that saves me when the next leap second occurs.