CPU reliability – Linus Torvalds (2007)

[+] zxcdw|12 years ago|reply

I don't work in environment where I get to deal with hardware failures, so pardon my ignorance, but has anyone seen a failed CPU piece which has failed during normal operation? I am under an impression that it is very rare for a CPU itself to fail so that it would need to be replaced.

The only times I've even heard about failing CPUs has been if they've been overclocked or insufficiently cooled(add in overvolting, and you get both :)) or physical damage during mounting/unmounting or otherwise handling hardware. And even then the failure has usually been elsewhere than the CPU itself.

Of course I am not saying it'd be unheard of, but for me frankly, right now it is.

[+] ChuckMcM|12 years ago|reply

" has anyone seen a failed CPU piece which has failed during normal operation?"

Several. But I'm lucky in that I worked for NetApp for 5 years which have several million NetApp filers in the field that were all calling home when they had issues, and Google which has a very large number of CPUs all around the planet doing their bidding. With visibility into a population like that you see faults that are once in a billion happen about once a month :-).

Two general kinds of failures though, the more common one is a system machine check (the internal logic detected the fault condition and put the CPU into the machine check state) which happens when 3 or more bits go sideways in the various RAM inside the mesh of execution units. Nominally ECC protected it can detect but not correct multi-bit errors. Power it off, power it on, and restart from a known condition and it's good as new.

The more rare occurrence is that something in the CPU fails which results in the CPU not coming out of RESET even, or immediately going into a machine check state. When you find those Intel often wants you to send it back to them so they can do failure analysis on it. The most common root cause analysis for those is some moderate electrostatic damage which took a while to finally finish the process of failing.

Some of the more interesting papers at the ISSCC are sometimes on lifetime expectancy of small geometry transistors. They are a lot more susceptible to damage and disruption due to cosmic rays and other environmental agents.

[+] forrestthewoods|12 years ago|reply

I think you're only considering catastrophic unable to boot failure. A CPU could be going bad with the effects less obvious.

Microsoft released research that shows enthusiast overclocking has a clear increase in hardware faults.

[1] http://www.extremetech.com/gaming/131739-microsoft-analyzes-... [2] http://research.microsoft.com/pubs/144888/eurosys84-nighting...

[+] rodgerd|12 years ago|reply

From one of my personal machines:

[14865975.000023] Machine check events logged

MCE 0

CPU 0 BANK 2

ADDR 1438280

TIME 1384859595 Wed Nov 20 00:13:15 2013

STATUS d40040000000011a MCGSTATUS 0

MCGCAP 104 APICID 0 SOCKETID 0

CPUID Vendor AMD Family 6 Model 8

This will, of course, be getting replaced shortly, preferably before it does any real damage. Given it's a 2002-era chip, 11 years of service isn't exactly terrible.

I've seen quite a few UltraSPARC chips (especially IIIs) go over the years at work, and often had a shit of a time trying to get Sun to accept them as faulty and replace them.

[+] tonyarkles|12 years ago|reply

I had a Celeron that had its cache go bad once. It would work "fine" in Windows, but Linux would report that the CPU was throwing some kind of exception. If you went into the BIOS and disabled the cache it would be stable, but with the cache on it would crash after a day or two. Swapped out the CPU and the machine lived a great life for a long time.

[+] dntrkv|12 years ago|reply

What a crazy coincidence. I was actually just helping a friend troubleshoot his dead PC and I told him to test his ram, video card, motherboard, powersupply, in that order, and not to bother with the CPU since I have never had one fail on me. It ended up being the CPU that went out. I go on Hacker News a couple hours later and see this post. heh.

[+] aimhb|12 years ago|reply

It was posted here on HN a few days ago, but you might find the DEFCON talk on single bit domain errors relevant: https://www.youtube.com/watch?v=ZPbyDSvGasw

In summary, some data centers are run hotter than recommended, which leads to a lot of mostly ignored domain resolution errors, which leads to a security risk.

[+] vanderZwan|12 years ago|reply

My dad had a laptop which would not boot unless he put it in the fridge first for half an hour. As long as he didn't reboot everything then worked "fine". Does that count?

[+] valarauca1|12 years ago|reply

I had this problem once when overclocking an AMD phenon. The short story is (I don't know the whole cause) the on-board crypto units stopped being random.

Which wasn't a real problem for 'some' day-to-day use. This was in the mid to later 00's, so https wasn't quiet everywhere yet.

The problem manifested slowly. When ever I'd connect to HTTPS, my browser would crash. One sound card would phone home for an update, and my computer would crash. Randomly certain games would crash when ever anti-cheat software attempted to run.

It was just odd, and took a few days of hunting to find out what was actually going wrong.

[+] naner|12 years ago|reply

Yes, CPUs can fail just like any other hardware component. On desktop systems the most common case is you'll try and boot the system and just be presented with blank video or a beep code. On server systems with multiple CPUs there usually will be an error reported via blinking light or the little info LCD on the front of the system. In some cases the damage is actually visible on the CPU (e.g. discoloration of some of the gold contact points on the bottom). CPUs under normal operation fail less frequently than most other components in my experience.

[+] zhemao|12 years ago|reply

I think the MTBF is generally longer than people would normally go without replacing their CPU. Also, CPUs are generally designed to degrade more gracefully. For instance, they may have circuity that scales the frequency down as delays get longer. Also, in multicore CPUs, there are generally some spare cores that will get swapped in if a previously in-use core breaks.

[+] cnvogel|12 years ago|reply

A machine here at work logged cache related machine check excel exceptions, at a rate of roughly 1/day but not regularly or deterministic. Not related to load or temperature and even after clocking lower than spec'ed. Changing the CPU fixed it.

Those were correctable errors, prime95 or memtest did not detect anything.

[+] grumps|12 years ago|reply

I use to work in situations where we had to account for failures. We had lab equipment that would just run all that time, and with terrible wiring and we were horrible to it too. We left covers off, piled stuff on top of it and just Frankenstein the hell out of all of it. I even managed to flash a new OS/App at the same time of a power failure but it still lived....

Failures were more prominent in memory...but they did happen. We also sent equipment through environmental testing that would force failures. I don't recall of hearing of any CPU failures. Although most of our equipment was DSP & FPGA based but there were some tiny 'lil CPU's there.

[+] gnoway|12 years ago|reply

I have seen it happen one time that I can remember, where I was sure it was the CPU. We had 8 dual socket 5400-era Xeon servers in a VMware cluster. Whenever 64bit Windows 2008+ virtual machines were started on or were vmotioned to one of the hosts, they would bluescreen. We did not experience this behavior at all, then one day we did. I replaced both CPUs and the problem disappeared. I have to assume it was one of the CPUs.

It's entirely possible that they overheated, but if they did it was due to poor cooling; we did not overvolt or overclock these machines.

[+] zurn|12 years ago|reply

It's not easy to get a CPU chip failure reliably diagnosed in the field. Even if you manage to do the trial and error component swapping dance pointing to the CPU, you don't get very good confidence. Might be that new CPU taxes the power feed less or there was misapplied cooling paste or a bad contact in the pins etc.

[+] cafard|12 years ago|reply

A couple of times: Sparc chips that did some notable damage on their way out. One was pretty old, one not so much.

[+] Shorel|12 years ago|reply

I saw a 386 let the magic blue smoke out.

[+] Taniwha|12 years ago|reply

so not even mentioned here is metastability - basically signals that cross clock domains within traditional clocked logic where the clocks are not carefully organized to be multiples of each other can end up being sampled just as they change - the result is a value inside of a flip-flop that's neither a 1 or a 0 - sometimes an analog value somewhere in between, sometimes an oscillating mess at some unknown frequency - worst worst case this unknown bad value can end up propagating into a chip causing havoc, a buzzing mess of chaos.

In the real world this doesn't happen very often and there are techniques to mitigate it when it does (usually at a performance or latency cost) - core CPUs are probably safe, they're all one clock but display controllers, networking, anything that touches the real world has to synchronize with it.

For example I was involved with designing a PC graphics chip in the mid '90s - we did the calculations around metastability (we had 3 clock domains and 2 crossings), we calculated that our chip would suffer from metastability (might be as simple as a burble on one frame of a screen, or a complete breakdown) about once every 70 years - we decided we could live with that as they were running on Win95 systems - no one would ever notice

Everyone who designs real world systems should be doing that math - more than one clock domain is a no no in life support rated systems - your pacemaker for example

[+] caf|12 years ago|reply

If a failure mode was likely to happen once every 70 chip-years of operation, then it seems like if you sold a few hundred thousand chips then you would expect several instances of that failure mode to occur across the population of chips every day?

[+] elwell|12 years ago|reply

This field really interests me.

[+] pedrocr|12 years ago|reply

It would be awesome if companies like Google would calculate MTBF statistics on components. They've done it for disks and it would be great to extend it to CPUs and memory modules. They're probably in a better position than even Intel to calculate these things with precision.

[+] bcoates|12 years ago|reply

Here's one for RAM in servers, from Google: http://news.cnet.com/8301-30685_3-10370026-264.html

Found it here, which also goes into some testing the Guild Wars guys did on their population of gamer PCs: http://www.codeofhonor.com/blog/whose-bug-is-this-anyway (scroll down to "Your computer is broken", around 1% of the systems they tested failed a CPU-to-RAM consistency stress test)

Both of them indicate intermittently defective components in running systems are way more common than anybody assumes.

[+] rkangel|12 years ago|reply

They'd have to be careful with how they quoted the numbers though. As Linus accurately points out, MTBF varies wildly depending on the usage pattern. If you want to quote it in a unit of time, e.g. "years", then you have to specify the usage the part has been under, which will be very different for a server part compared to a desktop part. You could quote it per instruction or equivalent, I suppose, taking into account how hard the component is used, but even that isn't perfect.

[+] amscanne|12 years ago|reply

Interesting relevant paper: http://www.cs.cmu.edu/~bianca/fast07.pdf

Table 3 suggests that there are data sets that include all components (CPU, memory, power supplies, etc.).

[+] zebra|12 years ago|reply

I'm almost sure that the components without moving parts will become technologically obsolete long before they start to fail. When I buy used laptop I always change the HDD, the DVD and its reliability jumps sharply up.

[+] wging|12 years ago|reply

They might very well do this already. But it'd seem they're disincentivized to make this information public... do they publish the disk information?

[+] rdtsc|12 years ago|reply

There was an interesting quote/anecdote, Joe Armstrong likes to tell, it is about people who claim they've built a reliable or fault tolerant service. They would say "This is fault tolerant, they are multiple hard drives in there, I have done formal verification of my code and so on..." and then someone else trips over the power cord and that's the end of the fault tolerance. It is just a silly example, of course they'd properly provide power to an important rack of hardware, but the point is, in the simplest case the system is only as fault tolerant as its weakest components. It is that one bad capacitor from Taiwan that might the whole thing down, or just a silly cosmic ray.

One needs redundant hardware to provide certain guarantees about the service being up. This means load balancers, multiple CPUs running the same code in parallel and comparing results, running on separate power buses, different data centers, different parts of the world.

[+] shurcooL|12 years ago|reply

  > different parts of the world.

Still takes just one asteroid.

[+] bcoates|12 years ago|reply

Thread context: https://lkml.org/lkml/2007/5/11/179

[+] heaviside|12 years ago|reply

This study by Microsoft Research is interesting:

"Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs"

http://research.microsoft.com/apps/pubs/default.aspx?id=1448...

[+] sytelus|12 years ago|reply

If MTBF is such a big issue then would it be ever possible to build space craft that travels across the stars and still has ability communicate? I guess hats off to designers of Voyager and other spacecrafts whose MTBF seems to have crossed 36+ years for many components including CPU and power supply. But for inter-steller crafts that MTBF seems VERY low. And, seriously, MTBF of 5 years seems to be joke for desktop when lot of mechanical components with moving parts actually lasts longer.

[+] nwh|12 years ago|reply

Spacecraft and rovers use ridiculously armoured, redundant systems to get past the fact that they would fail quite regularly in such a hostile environment. The Curiosity rover in 2001 uses what would normally be quite an outdated 132Mhz CPU that's been specially shielded to achieve the reliability the program needs; even then there's two redundant systems that do health checks on one other to avoid bit flips. Even with all of that, they're running on only one CPU and trying to diagnose why the first one failed.

It's probably not fair to compare the MTBF of specialised hardware to the $35 CPU I bought at the retailer down the street either, the RAD750 processors in Curiosity cost almost a quarter of a million dollars each.

http://en.wikipedia.org/wiki/Comparison_of_embedded_computer...

http://en.wikipedia.org/wiki/Curiosity_rover#Specifications

http://en.wikipedia.org/wiki/Radiation_hardening#Radiation-h...

Though that said, Voyager is still happy running on it's 8064 words of 16 bit RAM, which is something.

[+] eck|12 years ago|reply

He's talking about desktop/server CPUs where people care about performance. If you don't care so much about performance, you can increase the transistor sizes, reduce the clock speed, and achieve totally insane MTBF... as space-rated hardware tends to do. Kind of like how server CPUs are underclocked to increase MTBF, but more so.

[+] hkmurakami|12 years ago|reply

That makes me wonder... Linus refers to this as well, but how much of the 36+ years can be attributed to the components actually being turned off?

Also, I'd imagine that space craft components are of an entirely different category of components that the off the shelf computing variety.

[+] greenyoda|12 years ago|reply

Spacecraft are not built out of the same grade of components as consumer and commercial hardware.

[+] seiji|12 years ago|reply

You can fake whole system reliability by incorporating redundant internal systems.

[+] raverbashing|12 years ago|reply

(Conventional) Solid state devices are very hard to fail - exception: flash memory

Apart from electron migration issues and failures by excess (voltage/temperature), they're pretty long lasting

Much easier to have a failure because of something else: capacitors failing, oxidation or mechanical failure (for example, because of thermal expansion/contraction)

I've seen people complaining about a dead CPU but I can't find it right now

[+] soundsop|12 years ago|reply

You are correct. I want to clarify that the failure process is electromigration, not electron migration. It is caused by electrons but it is ions in metal that migrate. Wikipedia has a good description: https://en.wikipedia.org/wiki/Electromigration.

I design integrated circuits and one of the constraints in selecting the width of wires is to make sure that the maximum current density is below the electromigration threshold.

[+] mrich|12 years ago|reply

As a side note, the whole site is an amazing collection of wisdom and worth bookmarking:

http://yarchive.net/

[+] AnonNo15|12 years ago|reply

I'd like to through my experience: I was in charge of 300+ x86 rack servers and around 50 desktops for 3 years and never seen a single CPU fail, even old Pentium 4 with dusty fans.

Disk failures are very common, followed by much rarer RAM chips and motherboards failures.

I suspect server chips are rated for 10-15 years average lifespan

[+] synthos|12 years ago|reply

Soft errors are a very real property of low-voltage digital electronics. I personally observed what could only be realistically explained as a soft error in a unit running customer hardware in the field. A single bit was flipped in the program memory of the embedded application and was causing the system to malfunction in an obvious and repeatable manor. We've since added CRC checking to the program memory and some of the static data sections to flag and reset this in the future.

[+] lispython|12 years ago|reply

There's a more than 100 pages's thread talk about GUP failure after two years use in Apple Support website. https://discussions.apple.com/thread/4766577

[+] dspeyer|12 years ago|reply

It doesn't seem worth it for Intel to measure MTBF. By the time they got good numbers for a specific chip, they'd be trying to sell its successor.

[+] gilgoomesh|12 years ago|reply

Long term failure rates are not usually measured in realtime but in deliberately heat elevated environments which simulate many years of stresses in a few months. This work is essential to ensure design decisions they've made don't accidentally cause their chips to fail after 2 years (which might be outside warranty lifetime but would still result in class action law suits and horrible publicity).

[+] Zardoz84|12 years ago|reply

I can say that the Z80 if my ZX Spectrum keep working since 1984... Or some old K6-2 300 was working this last year...

[+] mvanveen|12 years ago|reply

My immediate reaction is to ask how this reliability characteristic of CPUs affects critical software applications? Certainly some space missions and medical devices out in the field must have surpassed the MTBF mark for the given CPU deployment.

[+] jokoon|12 years ago|reply

I always wondered about this, but does it seem transistor do wear off over time ?

Does that mean a CPU/RAM/GPU will not perform as well as when it's brand new ?

[+] csmuk|12 years ago|reply

Never had a CPU go on me.

RAM yes, PROMs yes, CMOS batteries yes, PSUs yes, drives yes.

They're probably the most reliable bit of a computer.

[+] leokun|12 years ago|reply

Nice thing about the cloud is that someone else is worrying about this for you.

95 comments