top | item 4182642

Leap second causing Linux server crashes?

253 points| sathyabhat | 13 years ago |serverfault.com | reply

114 comments

order
[+] __david__|13 years ago|reply
It appears to be fixed in Linux 3.4 [1]. According to the original commit [2] it's been broken since 7dffa3c673fbcf835cd7be80bb4aec8ad3f51168 [3], which appeared in 2.6.26.

So, kernels between 2.6.26 and 3.3 (inclusive) are vulnerable.

[1] https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....

[2] https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....

[3] https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....

[+] moe|13 years ago|reply
Which, in summary, is pretty much every production kernel out there.

Spent the last two hours recovering servers, tomorrow will be another interesting day.

Whoever figured it'd be a good idea to INSERT[1] the leap-second instead of just slowing/accelerating time... <censored>

[1] Clock: inserting leap second 23:59:60 UTC

[+] glawatscheck|13 years ago|reply
I have 2.6.27 kernels here (SuSE 11.1) which seem unaffected so breakage might be a little later
[+] dfc|13 years ago|reply
Google uses a "leap smear" and slowly accounts for the leap second before it happens.[1] As long as you are not doing any astronomical calculations or constrained by regulatory requirements I think google has the right idea.

[1] http://googleblog.blogspot.com/2011/09/time-technology-and-l...

[+] jbeda|13 years ago|reply
As part of Google Compute Engine we provide an NTP server to the guest which is based on Google Production time. As such our VMs get to take advantage of this leap second smearing implementation. I was going to mention this at my talk at IO but forgot.
[+] ChuckMcM|13 years ago|reply
Not surprising. In spite of all press that Y2K was just a silly waste of money, its events like these that makes me suspect it would have been a much bigger deal if everyone had ignored it and fixed it after things where shown to break.
[+] pud|13 years ago|reply
A lot of engineers[1] spent a lot of time successfully fixing Y2K bugs.

Because nothing well known blew up, many people wrongly assumed that Y2K was never a real problem to begin with.

[1] I moved a Fortune 100 manufacturing company's database off an ancient mainframe that would've been disastrous come Y2K. It went smoothly and was thus a thankless job. They paid well though (mid six figures - those were the days).

[+] noselasd|13 years ago|reply
Why does everyone always say Y2K wasn't an issue ? I'm sure there were a lot of consultant making too much money with little work - however _alot_ of bug fixes were done, that would have caused problems. So because it was taken seriously , stuff were fixed and issues didn't happen because of that.

Personally, I fixed 3 Y2K bugs back then, 2 of them would have brought down a rather critical business support to simply crash every time new data arrived.

[+] duiker101|13 years ago|reply
2012. and we still have problems keeping track of time. This is both fascinating and scary.

P.S. for people wanting to know more this video is simple to understand but really amazing http://www.youtube.com/watch?v=xX96xng7sAE

[+] crazygringo|13 years ago|reply
Maybe we always will have problems?

It seems to be the unique class of bug that not only is it easy to forget to test, and won't ever show up until a particular date... but then affects everyone!

I can't think of any other kind of bug that never shows up ever, but then affects everyone. Rare bugs tend to stay rare, common bugs tend to get caught before they affect everyone... this is the exception.

[+] thaumasiotes|13 years ago|reply
From discussion of this same issue in prior threads, my takeaway was

(a) it's really not at all difficult to handle leap seconds, but

(b) the POSIX standard specifically disallows them, by specifying that a day must contain exactly 86400 seconds. (Analogously, imagine if leap days occurred as normal, but a "year" by definition contained exactly 365 days.)

The existence of leap seconds means that it's not possible to simultaneously have (1) system time representing the number of seconds since the epoch, and (2) system time equal to (86400 * number_of_days_since_epoch) + seconds_elapsed_today, and all the proposed methods of dealing with the problem involve preserving (2), which seems worthless to me, and throwing away (1), which I would have thought was a better model.

edit: actual system times may be in units other than seconds, but the point remains

[+] mbq|13 years ago|reply
Its... worse. We can track time so easily and so well that we decided to screw it up.
[+] spindritf|13 years ago|reply
2012 and we still don't really know what time is. The fact that we can keep track of it, even with all the problems, is quite amazing.
[+] zerostar07|13 years ago|reply
In this case, i believe we created a problem we did not have. Leap seconds is a dubious construct from the start, problematic with computers or space travel. We have added only 25 since 1972. Their unpredictability means they will be forever a problem with computing. We should either quit the whole idea or in the worst case allow them only every 25 years or so.

Edit: In fact there is strong indication that they may be abolished: http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish...

[+] andreasvc|13 years ago|reply
Nice video but what if we used Sidereal time? (i.e., star time, which ignores the earth's rotation around its axis).
[+] kabdib|13 years ago|reply
Fear the Unix 32-bit time-becomes-negative bugs, in 2037.

We have 25 years to get ready. I still think we'll be patching at the last minute.

(Yeah, lots of systems will be 64-bit by then, but there will still be a lot of embedded crackerbox systems running 32-bit timestamps. It's all the embedded stuff I'm worried about).

[+] bcantrill|13 years ago|reply
It's 2038, not 2037.[1] (Specifically, January 19th, 2038 at 3:14:08am.) And while lots of systems will be 64-bit, many programs still won't be -- and it seems highly likely that this will be a significantly more serious and widespread problem than, say, Y2K or DST. (And certainly more serious than leap seconds, which happen relatively frequently.) Then again, I might be biased: perhaps I'm secretly hoping to spend the years leading up to 2038 paying for my retirement with high-priced consulting gigs to fix it...

[1] http://en.wikipedia.org/wiki/Year_2038_problem

[+] btilly|13 years ago|reply
If you think that being 64-bit protects you, then you do not understand the problem.

The problem is that 32-bit time is embedded in filesystem representations and related protocols. (eg the POSIX specification for file times in tar.) Therefore even if your machine is 64-bit, it still needs to use 32-bit time for many purposes.

To name a random example, the POSIX specification for times in the tar format is 32-bit. GNU tar has a non-standard extension that already takes care of it. But will everything else that expects to read/write tar files that a GNU tar program implement the same non-standard extension to the format in the same non-standard way? Almost certainly not. And there will be no sign of disaster until the second that we need to start relying on that more precise representation.

[+] kzk_mover|13 years ago|reply
Now facing this issue... By using 'adjtimex' command, you can clear the problematic INS bit.

At first, you can confirm the status flag like this.

    $ ./adjtimex --print | grep status
    status: 8209
8209's binary representation is like this. This surely have INS bit "100000000[1]0001" (5th LSB).

    $ ruby -e 'p 8209.to_s(2)'
    "10000000010001"
8193 is the value after the clearance of the INS big.

    $ ruby -e 'p 8193.to_s(2)'
    "10000000000001"
Then, let's set it as a current value. Please ensure your ntpd is not running.

    $ adjtimex --status 8193
[+] MrUnderhill|13 years ago|reply
Novell kb: http://www.novell.com/support/kb/doc.php?id=7001865

  SLE9 (kernel 2.6.5-7.325): NOT AFFECTED
  SLE10-SP1 (kernel 2.6.16.54-0.2.12): NOT AFFECTED
  SLE10-SP2 (kernel 2.6.16.60-0.42.54.1): NOT AFFECTED
  SLE10-SP3 (kernel 2.6.16.60-0.83.2): NOT AFFECTED
  SLE10-SP4 (kernel 2.6.16.60-0.97.1): NOT AFFECTED
  SLE11-GA (kernel 2.6.27.54-0.2.1): VERY UNLIKELY
  SLE11-SP1 (kernel 2.6.32.59-0.3.1): VERY UNLIKELY
  SLE11-SP2 (kernel 3.0.31-0.9.1): VERY UNLIKELY

  Update (06/26/2012): after thorough code review -> SLE9 and SLE10 not affected at all.
[+] brongondwana|13 years ago|reply
FYI: I've updated the post with details of the workaround as implemented on our servers.
[+] shaggy|13 years ago|reply
Pardon the ignorance if this is a stupid question. I've been looking at some of my hosts and have noticed a message "Clock: inserting leap second 23:59:60 UTC" in dmesg output but each of the hosts is in the EDT timezone so the I was under the impression that the leap second hadn't been applied yet. So what does that mean? That the systems have applied the leap second successfully or have only received it from their NTP servers?
[+] DEinspanjer|13 years ago|reply
The leap second is applied at midnight UTC time, regardless of what timezone the server is in.
[+] piggity|13 years ago|reply
We just had 100s of EC2 instances generate high (alleged) load. Instances had load averages of 90+ but were responsive.

Running on a 3.2 kernel

Rebooted them all and they're fine.

[+] kristopher|13 years ago|reply
FYI: Our Debian servers did not kernel panic but system CPU load went through the roof; A quick restart brought levels back to normal.
[+] wiredfool|13 years ago|reply
My Ubuntu 10.04 desktop went to 100% proc and load avg of 20, none of my 10.04 servers or Debian stable servers were affected.

This fixed it:

  date; sudo date `date +"%m%d%H%M%C%y.%S"`; date;
[+] politician|13 years ago|reply
After reading these tales of woe, all I can say is that I hope the criminal element doesn't start assaulting NTP servers.
[+] mootothemax|13 years ago|reply
I was logged on to a couple of CentOS 6 servers when I saw this happen, and on each one the Java processes went absolutely haywire. Everything else seemed to work fine.

I attempted to fix with adjtimex and the script in the linked question, but to no avail, in the end having to restart them all instead. After that, all was good again.

[+] cagenut|13 years ago|reply
I just had the exact same experience.
[+] scottbruin|13 years ago|reply
Had the same issue across all our VMs running Java/Tomcat applications.
[+] glawatscheck|13 years ago|reply
POSTMORTEM fix for CPU eating softirqd threads without rebooting:

stop ntpd, run ntpdate or sntp, start ntpd

/etc/init.d/ntp stop; sntp -s <ntpserver>; /etc/init.d/ntp start

Unfortunately sntp / ntpdate wrapper is not shipped with squeeze for example. I've used the binary from SuSE 11.4 just fine on squeeze.

[+] glawatscheck|13 years ago|reply
OK this is how it works on squeeze etc.:

apt-get install ntpdate; /etc/init.d/ntp stop; ntpdate pool.ntp.org; /etc/init.d/ntp start

[+] raverbashing|13 years ago|reply
Ouch!

My Debian GNU/Linux 6.0 is still standing

Oh well, reading the issue, the machine date is Sat Jun 30 16:11:31 EDT 2012

Stopped ntpd just in case

[+] rbanffy|13 years ago|reply
Same here. Set ntp to restart in 12 hours.
[+] yaix|13 years ago|reply
Two days ago while booting, the BIOS time on my eeepc was suddenly reset, with an error message on boot to adjust the time manually. Was just thinking that it may be related?
[+] sayeed|13 years ago|reply
Our Linux instances running on Amazon EC2 had no issues since we are not running ntpd on these servers and adjtimex returns status as 64 (clock unsynchronized).

I think the Xen host takes care of the synchronization and we need not do it in the guest OS. (see http://serverfault.com/questions/100978/do-i-need-to-run-ntp...).

Is this fine or should we run ntpd for better accuracy?

[+] csarva|13 years ago|reply
Yes. This issue notwithstanding, you should be running ntpd.
[+] arohner|13 years ago|reply
Stupid question: Why was this not caught? Seems pretty easy to test. Just set the clock to today (or any day with a leap second), and watch what happens.
[+] duskwuff|13 years ago|reply
> Just set the clock to today (or any day with a leap second), and watch what happens.

That won't work. The bug is only triggered when an upstream NTP server reports that a leap second was scheduled. Since leap seconds aren't predictable (and aren't even scheduled very far in advance), just setting the time back to the date of a previous leap second won't do anything.

[+] cullenking|13 years ago|reply
On debian, I was able to fix the issue (fix the load issue specifically) with this command

/etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date;

[+] chmod775|13 years ago|reply
If really all of the Linux where affected more than half of the Internet would be still down by now. Could be only a specific combination of kernel/userspace bugs that only exists in some systems.

What a bit sucks is that my VPN was affected to (openvpn) causing my computer to do a poweroff. I replaced the poweroff with

ip route add to 192.168.1.0/24 dev lo

hope that saves me when the next leap second occurs.