To ECC or Not to ECC

[+] mehrdada|10 years ago|reply

Please note that the main purpose of ECC is not to reduce RAM error rate and make it look more reliable, but to help the system stop the process when an unrecoverable memory error occurs as opposed to propagating it and resulting in unpredictable outcomes. The change in the effect of failure is what matters most, not the probability of it. Without ECC, there's often no clear way to realize that the result of a computation is valid or garbage and should be discarded.

(Of course, in extreme scenarios, like at Google scale, even ECC can fail to fail due to multibit errors, but in almost all non-pathological scenarios, SECDED[1] is enough to catch all erroneous cases.)

[1]: http://cr.yp.to/hardware/ecc.html

[+] sireat|10 years ago|reply

Exactly, you want to know when the error is due to memory.

Intel deciding that consumers (including those buying Haswell-E CPUS) do not need ECC really irks me. Textbook market segmentation from a near monopoly.

Currently you can not have your cake and eat it:

You cannot have the best single-thread performance (offered by overclocking Haswell-E series or Skylake 6700k) and have ECC.

So if one is building the ultimate workstation, you have a hard choice, do you go with X99 chipset(no ECC but can overclock) or do you go to the server motherboards with C610 chipsets which are quite limited as far as consumer interests are.

Interesting are the Intel mobile Xeons which now provide a venue for ECC on a laptop.

[+] graycat|10 years ago|reply

Good thoughts. Thanks.

Two questions:

(1) IIRC, some operating systems, seeing some ECC errors, maybe just the uncorrectable ones or maybe also the correctable ones, moved to mark the associated memory, or block of memory, as faulty, maybe stopping the (applications) program using that memory, and continued on. Is this done with current operating systems?

(2) What would Windows Server do with a thread, process, address space or whatever the heck that encountered a memory error detected by ECC, especially one that was uncorrectable?

I'm eager to know since I'm eager to build a server, with ECC memory, and run Windows Server in production.

[+] trav4225|10 years ago|reply

Yeah, agreed. I found myself wondering why he kept referring to "reliability". Perhaps he defines it to include data integrity, but it didn't come across that way to me. I kept thinking to myself "nobody I know uses ECC just to increase reliability".

[+] acqq|10 years ago|reply

I believed that Google from the beginning implemented their own checksum codes in their software to regularly verify the data processed or communicated on their non-ECC computers. And I doubt the open-source software the article author uses does the same?

[+] PhantomGremlin|10 years ago|reply

I know that Jeff is a demigod to some people, but I interpret this article as: "As a software guy, I don't really understand why I need this fancy hardware, so this can't be important". IMO he's wrong.

The margins between working and non-working DRAM these days are extremely small. E.g. Rowhammer demonstrated that even user-space programs could readily obliterate main memory, without even trying very hard to do so.[1]

But, maybe in this case he's right. It's not like "open source Internet forum software" is anything that's mission critical. If there's an occasional garble in a character or two, will the latte-swilling hipsters even notice? :-)

Just like the original Google servers he points to. Who cares if they occasionally screwed up in reporting search results, because they didn't have ECC memory. Overall the experience was still 100x better than using something like Altavista.

[1] https://en.wikipedia.org/wiki/Row_hammer

[+] theandrewbailey|10 years ago|reply

What Jeff is trying to say is: if ECC is so desperately needed to prevent memory errors that are supposedly happening all the time, why isn't ECC in every computer everywhere?

[+] dogma1138|10 years ago|reply

Just to be clear SECDED ECC doesn't protect you against row hammer and similar memory disturbance attacks.

DDR4 implemented some mitigation against such attacks as well as some additional soft ECC mechanisms but as these types of attacks are fairly new it's not quite yet clear as how effective they are.

[+] teddyh|10 years ago|reply

We once had a new server with all new hardware which had weird problems and kept crashing mysteriously. Memory tests showed no errors, so we were all tearing our hair out. We took the server offline and set it to test continously – still no errors. After running Memtest86 on nothing but test #4, for about a day or so – then a few memory errors showed up. Replaced memory, problem gone, server started working.

Memory errors are especially insidious compared to how common they are. ECC is worth it.

[+] rwmj|10 years ago|reply

I remember circa 1999 having a database server which had a stuck bit in memory. The bit happened to be placed in the page cache, so it subtly corrupted disk writes resulting in the database throwing checksum errors. It took an insane amount of time to even diagnose where the problem could be. We of course thought it was the disks themselves and tried many variations of disks and external RAID cards. Finally, one run of memtest86 found the real problem, and I threw away the memory and motherboard and replaced it with one capable of ECC RAM.

I forget now why we even thought to build a server without ECC RAM, but I sure learned my lesson after that.

[+] beachstartup|10 years ago|reply

i wouldn't even call a machine without ecc a server or workstation. more like a consumer device that's been given a job it can't do.

[+] tzs|10 years ago|reply

I tried to catch soft errors for about a year on a couple of Linux boxes I had. They were both desktop form factor machines, one being used as a home server and one as a desktop at work.

I had a background process [1] on each that simply allocated a 128 MB buffer, filled it with a known data pattern, and then went into an infinite loop that slept a while, woke up and checked the integrity of the buffer, and if any of the data had changed logged the change and restored the data pattern.

Based on the error rates I'd seen published, I expected to catch a few errors. For example, using the rate that Tomte's comment [2] cites I think I'd expect about 6 errors a year.

I never caught an error.

I also have two desktops with ECC (a 2008 Mac Pro and a 2009 Mac Pro). I've used the 2008 Mac Pro every working day since I bought it in 2008, and the 2009 Mac Pro every day since I bought it in 2009. Neither of them has ever reported correcting an error.

I have no idea why I have not been able to see an error.

[1] http://pastebin.com/Bv56kVwC

[2] https://news.ycombinator.com/item?id=10600308

[+] Ono-Sendai|10 years ago|reply

Did you check the resulting (dis)assembly? If you compile with optimisations the reading (and maybe writing) to the RAM buffer may be optimised away.

[+] marcosdumay|10 years ago|reply

As soon as you have a power fluctuation, air conditioning malfunction, or a few dirty caused short cuts, you'll get enough errors to converge on the published average.

Just wait, and relax. You'll get there eventually.

[+] yuhong|10 years ago|reply

That is normal of course, and the published error rates are over large amounts of RAM I think.

[+] tshtf|10 years ago|reply

Soft errors are fairly common; in fact it allows for problems in DNS resolution such as Bitsquatting: https://www.defcon.org/images/defcon-19/dc-19-presentations/...

Anyone who has bought a popular bitsquatted domain name can attest to this.

[+] baby|10 years ago|reply

Also errors in packets signatures from TLS handshakes (http://cryptologie.net/article/294/factoring-rsa-keys-with-t...)

And I'm sure there are many other vectors of attacks using this flaw.

[+] Tomte|10 years ago|reply

IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in time"; per 10e-9 hours of operation) and gives the following sources:

a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130 nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009

b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies, Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004

d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel, 2003, IEEE Computer Society

e) Basic mechanisms and modeling of single-event upset in digital microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.

f) Destructive single-event effects in semiconductor devices and ICs, F. W. Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.

g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson, Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340

h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual Research Conference on Reliability Stanford University, October 2000

i) International Technology Roadmap for Semiconductors (ITRS), several papers.

If that's correct, the math is simple: you have bit flips in your PC about once a day.

It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.

[+] mehrdada|10 years ago|reply

> It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.

Also, most modern processors use ECC for their caches (even when the main memory is non-ECC) and they serve the vast majority of memory requests, so it is unlikely that intermediate values in a tight computation are affected by non-ECC RAM. That adds to the "silentness" aspect of the bit flip in consumer systems.

[+] cushychicken|10 years ago|reply

These things do happen with a reasonable amount of frequency. I used to work at a division of a major memory manufacturer that dealt with writing tests to find these DIMMs that exhibited these sorts of failures - the semiconductor industry calls them "variable retention transfers". (Aside: numerous PhDs in the field of semiconductor physics have built prosperous careers trying to understand why these soft failures happen. Short answer: we have some theories, but we don't really know.) It was provably worth millions of dollars to be able to screen for this sort of phenomenon, because a Google or an Apple or an IBM would return a whole manufacturing lot of your bleeding edge, high-margin DIMMs if they found one bit error in one chip of one lot. Each lot was shipping for millions and millions of dollars.

[+] CrLf|10 years ago|reply

Anyone who've managed even a modest amount of servers with ECC RAM for a reasonable amount of time has surely seen ECC events in their hardware logs. Most of these are one-time errors that never happen again on the same server, ever.

Without ECC these errors would have unknown consequences. They could happen in some unused region of memory, or they could happen in a dirty page in the filesystem cache. It's not fun to discover that your filesystem has been silently corrupted a unknown time after the fact.

Maybe Google doesn't need ECC. Their data is duplicated across several machines and it's extremely unlikely that a few corrupt servers would lead to any data loss.

However, on a smaller scale (and just like RAID) it's cheaper to have ECC than add more servers for extra redundancy.

[+] wmf|10 years ago|reply

Or he could have waited a few months and gotten ECC anyway: http://ark.intel.com/products/88171/Intel-Xeon-Processor-E3-...

[+] yuhong|10 years ago|reply

Interestingly, the only vendor which sells 16GB unbuffered ECC DDR4 DIMMs seems to be Crucial: http://www.crucial.com/usa/en/ct16g4wfd8213

[+] sebcat|10 years ago|reply

What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"

People need to adapt to a world where we have more cores instead of faster execution per core. You can't compare late 90's growth in execution speed per core with the situation we have today.

Write software for an environment where the number of cores scale, instead of an environment where the execution speed of a single core is more important.

[+] ketralnis|10 years ago|reply

> What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"

Is that so bad? He's writing and hosting the code, and he's paying the bill to do it. Seems to me he should be able to pick how to do it.

[+] vox_mollis|10 years ago|reply

This cannot possibly be right. There was a DC21 talk regarding DNS request misfires due to bit flips in non-ECC DRAM, and the researcher was able to collect a surprisingly large number of requests on the basis of this.

Edit: found it: https://www.youtube.com/watch?v=ZPbyDSvGasw

[+] ketralnis|10 years ago|reply

Importantly, those DNS packets go through a number of systems that are not clients or servers. Wifi, microwave antennae, undersea cables, consumer routers, unpowered hubs, you name it. It's hard to know whether these bit flips are actually coming from cosmic rays or EM interference or rare decompression bugs.

[+] Animats|10 years ago|reply

If soft errors are rare, parity checking, without correction, might be more useful. It's better to have a server fail hard than make errors. In a "cloud" service, the systems are already in place to handle a hard failure and move the load to another machine. Unambiguous hardware failure detection is exactly what you want.

[+] mehrdada|10 years ago|reply

In practice, you basically get one-bit error correction 'for free' when you have enough redundancy to detect two-bit soft errors. Simple parity can only detect one bit flip, so if you want to catch two-bit errors, you might as well correct one-bit errors you find on your way at no extra cost.

[+] scurvy|10 years ago|reply

I don't think that data corruption was a huge issue for Google back then (really early on). Corrupt data? Big whoop. Re-index the internet in another X hours, and it's gone. I doubt they had much persistent storage as most of their data was transient and well, the Internet.

Also, I still see "fire hazard" when I look at the early Google racks. No idea how Equinix let them get away with it. Too much ivory tower going on there. Not enough "you know we're liable if we burn down the colo with that crap, right?"

[+] upofadown|10 years ago|reply

There is no extra chance of a short circuit before the power supply. After the power supply the power is limited, either by explicit current limiting or just because they are switching power supplies where transformer saturation limits the power.

So you could have a PCB fire, but PCBs are made to be flame retardant. You could have a wire insulation fire, but the amount of material would be so low that it wouldn't be able to start a fire anywhere else.

So I am basically saying there isn't really anything there that could sustain a fire and that there isn't a lot of energy to start ignition in the first place.

[+] devit|10 years ago|reply

The article is wrong.

The Xeon E3-1270 v5 goes from 3.6 to 4.0 GHz and only costs 10% more than the i7-6700 (3.4-4.0 GHz)

Also, the Xeon E3-1230 v5 goes from 3.4 to 3.8 GHz (same base clock) and costs less than the Core i7-6700.

In general, you should never buy non-Xeon CPUs if you have the choice, both for desktop and for servers, since ECC memory is essential if you don't want to have a significant chance of having to replace your RAM after discovering mysterious problems with your system.

[+] venomsnake|10 years ago|reply

Isn't it simple enough calculation:

Will someone die if the data gets corrupted? No - then no ECC should be enough. And you should have checksums everywhere anyway.

[+] jo909|10 years ago|reply

where do you create that checksum? If its on a computer without ECC, you will just checksum the data including the error, then write that data including the error to disk.

What happens to the data after you have read it to memory and successfully verified the checksum? You probably process it in memory, and have no idea afterwards if the changes are due to your code, or because of errors.

Of course you can now propose to also checksum and check the data while it is in memory. Which is basically what ECC does, in hardware, for cheap, requiring no CPU cycles.

52 comments