top | item 40293943

Why use ECC? (2015)

159 points| vsgherzi | 1 year ago |danluu.com

137 comments

order
[+] kimixa|1 year ago|reply
I wish there was a hard requirement for ECC, as a developer working on GPU drivers, there's a huge amount of reported issues that just... don't make sense? One offs with slightly different symptoms, memory dumps of nonsense, just nowhere to start rooting out the cause for an issue. Even on "widely reported" issues that make it to reddit and similar.

Probably not surprising, there's a naturally antagonistic relationship between Performance and Reliability here, and it's clear which way many of those "enthusiast" forums lean.

I haven't got actual numbers, but I feel that most [0] of the issues I start looking at just can never be reproduced, or even make sense from the backtrace or similar. I can't say it's 100% hardware issues for this, as many games are a little... loose... with reliability if it works "well enough", and is heavily interacting with code and data we work on so might also be a source for "impossible" issues. But even on straightforward code paths, no weird OS interaction, no allocation, nothing async etc. "Impossible" states happen pretty regularly.

I would love there to be enough ECC-using gamers out there to statistically see if it makes a difference.

[0] Most in terms of number of different issues, not total reports of the same issue. That's dominated by one or two things, normally around the latest game or update doing something dumb :P

[+] jrockway|1 year ago|reply
Not having ECC is the biggest scam in computing. Ever hear of "bitrot"? That's memory errors that have been saved to disk. We have made millions of people lose their data so that servers can be artificially more expensive.

Intel was responsible for most of this. It is hard to be sad when seeing how they've lost the market lead.

[+] hi-v-rocknroll|1 year ago|reply
I would wager 99.99% of bitrot is silent corruption that goes unnoticed until it affects something particularly important. Without integrity and error correction in all paths along a processor, storage hierarchy, and network paths and at rest, there's no way to prove a system will ever remain reliable.
[+] cornholio|1 year ago|reply
I wonder if it's practical to do ECC on 64 bit words instead of bytes. A 13% price increase (or capacity drop for the same DRAM price) is substantial and might justify the penny pinching, 1.5% is negligible if it leads to a similar stability increase as standard ECC DRAM. If you are often getting more then one bit flip per 64 bit word, than that RAM is garbage anyway.
[+] ajross|1 year ago|reply
> Intel was responsible for most of this.

Only in the sense of "Intel is responsible for most of computation" ... No one uses ECC pervasively anywhere, that's sort of the point of the article.

[+] jan_Sate|1 year ago|reply
Huh? I thought that "bitrot" was like content saved into the disk and the disk left unpowered for an extended period of time causing data loss. And I thought that the content stored on the disk has ECC on its own?
[+] transpute|1 year ago|reply
PC Engines $150 APU2 (RIP) shipped with 4GB ECC RAM and AMD Embedded CPU. Since it was a headless device used mostly for 1GbE networking, the RAM was throttled and relatively impervious to Rowhammer.

QNAP has a $600 1U short-depth (11") 4x3.5 2xM.2 2x10GbE 2x2.5GbE 4-32GB DDR4 SODIMM Arm NAS that would benefit from OSS community attention. Based on a Marvell/Armada CN9130 SoC which supports ECC, has mainline Linux support, and public-but-non-upstream code for uboot [2]. With local serial console and a bit of effort, the QNAP OS can be replaced by Arm Debian/Devuan with ZFS. Rare combo of low power, small size, fast network, ECC memory and upstream-friendly Linux. QNAP also sell a 10GbE router based on the same SoC.

Ryzen Pro (OEM) can support ECC [3].

[1] https://www.qnap.com/en-us/product/ts-435xeu

[2] https://solidrun.atlassian.net/wiki/spaces/developer/pages/3...

[3] https://www.tomshardware.com/pc-components/cpus/amd-confirms...

[+] TheAmazingRace|1 year ago|reply
So I have to say, ECC memory is definitely something we should not have gotten away from for consumer hardware. My current PC, which is rocking a Core i9 14900k (pray for me) and an ASUS W680M ACE SE motherboard, allowed me to install some 5600MHz speed DDR5 ECC memory, and it works flawlessly.

The only downside in my view is the cost. Unbuffered ECC and the cost of using a workstation class chipset really pushes this into luxury territory. Plus, I'm never too sure what Intel's future plans are for successor processors and chipsets, which is why I settled on W680. I don't really want to go full blown Xeon.

[+] Sweepi|1 year ago|reply
the more significant downside than cost is speed, with your DDR5-5600 ECC having most likely a latency of 16ns, while DDR5-7000 non-ECC 12ns (10ns if you are only interested in Column Strobe) is available for your platform, which has 25% more Bandwidth while also featuring a 25% - 40% lower latency.

Dont be fooled (like me) by the DDR5-6000/6400/6800 ECC registered Modules, all Desktop Motherboards only support unbuffered Modules, and most dont even support DDR5-5600 ECC, only DDR5-4800/5200 ECC.

[+] pixelpoet|1 year ago|reply
14900k might be one of the least reliable CPUs in recent times, pairing it with ECC seems almost ironic!
[+] pseudalopex|1 year ago|reply
> Unbuffered ECC and the cost of using a workstation class chipset really pushes this into luxury territory. Plus, I'm never too sure what Intel's future plans are for successor processors and chipsets, which is why I settled on W680.

Some cheap AMD motherboards support ECC. But the future is unknown. Ryzen 8000 CPUs don't.

[+] summerlight|1 year ago|reply
https://discourse.codinghorror.com/t/to-ecc-or-not-to-ecc/37...

Interestingly, Jeff Atwood has changed his mind on ECC memory.

[+] pixelpoet|1 year ago|reply
Even more interestingly in the comment below, apparently you can just flick on ECC for RTX 4090 cards!

Extra weird is Nvidia singling out ray tracing as a use case which shouldn't use ECC... I suppose it's no biggie if a single ray goes the wrong way down the BVH, out of trillions.

[+] sufehmi|1 year ago|reply
As soon as I worked with high-load servers in the 90s, it's already clear that ECC should be the default.

Intel's marketing ploy on ECC is very desctructive and have costed many parties a lot of wasted time money & resources to handle problems caused by non-ECC memory ; and Linux Torvalds is absolutely right in roasting them for this.

[+] magicalhippo|1 year ago|reply
Memory corruptions can impact very differently. A sample of decoded music getting corrupted leads to a small glitch, maybe even inaudible. An instruction in executable code getting corrupted can leads to all sorts of havoc.

Since ECC is seemingly not getting mandatory, I've been wishing CPUs would support "soft-ECC". That is, the OS could mark certain pages as needing "soft-ECC", and the CPU would then store (at least) three copies of that page in RAM. When reading such pages back from RAM the CPU would read all physical copies and compare. If the majority agrees it can use that, otherwise raise an error.

This could then be used for executable pages and important configuration data which occupies relatively few pages, and where integrity matters a lot more than speed.

There's probably some good reasons why this is non-trivial to implement, I've forgotten most of what I learned about the virtual memory implementation in CPUs. But a man can dream...

[+] HideousKojima|1 year ago|reply
The triple reading/writing to memory along with the comparing would probably be a significant performance hit. You could just use a bit of extra memory for parity bits etc. instead
[+] grog454|1 year ago|reply
How does the OS know which pages to mark?
[+] Animats|1 year ago|reply
ECC memory should have a price premium of only 1 - 9/8, or 12.5%. It costs more than that, because it's "enterprise".
[+] thfuran|1 year ago|reply
Probably slightly more than just the increase in memory modules, since there's also the extra complexity of actually checking/reporting, but roughly.
[+] wmf|1 year ago|reply
It's now 25% for DDR5.
[+] ido|1 year ago|reply
True, as is explicitly mentioned in the article.
[+] snvzz|1 year ago|reply
ECC should be a requirement.

The FCC could just not allow computers to ship it without.

CPU makers like Intel and AMD could simply have their CPUs not work with non-ECC RAM.

Microsoft could e.g. require ECC RAM for Windows 12.

It is insanity that most computers shipping today do not use ECC and are thus unreliable.

With luck they'll crash, but most likely they will fail silently, while corrupting data.

[+] Nerada|1 year ago|reply
DDR5 comes with on-die ECC. My understanding is this only checks errors occuring within the RAM itself, not errors that occur during transmission to and from RAM.

My question is, how common are transmission errors over errors happening within RAM?

[+] pclmulqdq|1 year ago|reply
On-die ECC is so they can give you a memory array with a few faults. It's a yield enhancement not an introduction of ECC as you think of it.

Adding protocol-level ECC on top only helps, although it is somewhat inefficient.

[+] gjjydfhgd|1 year ago|reply
Another problem with on-die ECC is the lack of reporting.

You have no idea if you have tons of errors and how many were corrected.

[+] geerlingguy|1 year ago|reply
LPDDR4/4X has also had on-die ECC for a while (at least the chips I'm used to, like in the Raspberry P); with such small lithography it's basically required to get the ram to work reliably.
[+] Sweepi|1 year ago|reply
I would love to put ECC in my Desktop computers, however its more expensive (ok), is not officially supported on most Desktop Motherboards (and in reality does not work in "ECC-Mode" on the majority of them) and finally: ECC Ram available to purchase is painfully slow, in both bandwidth (:/) and latency (://)
[+] sph|1 year ago|reply
Please, I'd love someone to tell me how to find and buy computers that support ECC. I am looking to buy a NUC/mini-server, and they basically all sell with non-ECC RAM. Last time I asked on this forum, I was told that on Intel, only Xeon processors support ECC, while all modern (?) AMD CPU support them. Elsewhere I read that was matter is the mobo needs to support it. I have no idea how to go about it.

So, let me ask again. I was to buy a NUC new or off Ebay, how can I be 100% sure it works with ECC RAM without having to spend half a hour researching CPU, mobo and BIOS specs for each single product I come across?

If I had a budget in the thousands, I would go with a Xeon server that comes with ECC pre-installed. I don't and have modest needs. I only want to splurge on ECC RAM to replace the original sticks.

(No "you don't need ECC for a NUC" reply please. That is not the point of my question, yet it is a far too common response)

[+] adrian_b|1 year ago|reply
I have never seen any true NUC-like computer that supports ECC SODIMMs. Intel has also used the NUC brand for a much larger computer that supported laptop Xeon CPUs, but that line has been abandoned.

There have been some NUC-like computers from ASRock industrial, Supermicro and others, with either Tiger Lake or Elkhart Lake CPUs, where you could enable in BIOS the so-called in-band ECC.

All these models are obsolete. Moreover, in-band ECC is an ugly and inefficient workaround. It can be used with soldered LPDDR memories, which do not have ECC variants, but it has worse performances than standard ECC. It is not cheaper, because it diminishes the memory capacity in the same ratio as any ECC and it requires a greater die area inside the CPU for its implementation (including a dedicated cache memory for the ECC bits).

There are many mini-ITX motherboards that support ECC (but you must check carefully the specifications, even if they are for AMD CPUs). For a smaller size than mini-ITX, there are 2 choices, either expensive industrial single-board computers, which usually have the 3.5" form factor of the PCB, or one of the so-called mobile workstation laptops from Dell, HP or Lenovo, e.g. a Dell Precision mobile workstation, which are also much more expensive than an equivalent NUC-like computer.

So, if a low price is desired and an up-to-date fast CPU, you cannot have ECC in form factors smaller than mini-ITX. If paying double or triple is not a problem, there are solutions.

If you want a preassembled small computer with ECC and a mini-ITX motherboard, there are some at companies like ASRock Rack or Supermicro, but they are much more expensive than if you get the best components and you assemble them yourself.

[+] user_7832|1 year ago|reply
Recent AMD processors need to be AMD pro series to support ECC. Motherboard support is also required. On intels side I think standard Xeon type commercial boards very often support it. Unfortunately you’ll likely need to ask around when buying to ensure it supports it. If you can, getting a mini ATX known-good mobo in a small case may be easier.
[+] petronio|1 year ago|reply
I was on the NUC search a while ago and I'm not sure you can. Although the AMD motherboards may not support ECC, I haven't heard of any that actually don't. Best bet is probably to buy a recent, barebones AMD NUC, and buy the ECC RAM yourself. Sometimes they'll advertise ECC support as well.
[+] P_I_Staker|1 year ago|reply
keep reading. books will set you free
[+] eadmund|1 year ago|reply
What’s the best price/performance for a home lab server running Linux with ECC these days? Bonus points if it is rackable.

Sadly, my go-to Linux hardware manufacturers either don’t offer ECC RAM, or only offer it as an option on their absolute top-end machines. Yes, yes, the extra two thousand dollars for a machine with a six-year lifespan probably is worth it on a monthly basis, but man it still hurts.

[+] adql|1 year ago|reply
> What’s the best price/performance for a home lab server running Linux with ECC these days? Bonus points if it is rackable.

Old used enterprise server. None of them will be great at power/performance in typical (i.e. mostly idle) home use tho. Intel ones usually far better here

[+] NorwegianDude|1 year ago|reply
I recently(ish) built a new home server using an cheap AM5 motherboard from ASUS that supports ECC. Good performance and power usage is around 45 W idle with a couple of SSDs and a couple of HDDs spinning.

Not the cheapest, but I wanted to keep power consumption low for noise and reduced heating while still having good performance if needed.

I also considered a motherboard with IPMI on AM5(Asrock rack), but that was much more expensive.

Worked out quite nicely.

[+] Palomides|1 year ago|reply
put something together with used supermicro parts, maybe with a H11SSL-i or H12SSL motherboard and epyc cpu

or whatever dell 730 or something fits your budget

[+] BlueTemplar|1 year ago|reply
> From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not.
[+] oskarkk|1 year ago|reply
> For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink.

Nice. That got me curious, how many electrons are in today's DRAM capacitor? I tried searching but haven't found any recent info.

[+] dvt|1 year ago|reply
I tried building an old rig (maybe ~7 years ago or so) using ECC RAM (since I was running two Xeons). It was such a pain in the butt to get it to boot and find sticks that were compatible with each other, don't really want to go down that path again.
[+] danparsonson|1 year ago|reply
I built my most recent desktop using ECC and it was a breeze, so maybe you were just unlucky?
[+] _factor|1 year ago|reply
The cheap sticks work if you don’t mind buying a bunch and sending back the ones that don’t work together in your config.

Binning is likely the problem.

[+] Fnoord|1 year ago|reply
For me it was a one shot with a Xeon, no issues whatsoever. Any decent price comparison is able to filter on ECC memory.
[+] bjoli|1 year ago|reply
I just chugged 128gb of ddr4 LRDIMMs into my old xeon server. It worked flawlessly.
[+] nextaccountic|1 year ago|reply
Why not ECC CPUs and GPUs? They can be hit by cosmic rays too.
[+] nottorp|1 year ago|reply
I believe the first part could make for the start of a great 'if Google does it it doesn't mean it's good for you' article...
[+] forty|1 year ago|reply
I read somewhere that DDR5 has some kind of internal ECC mechanism even for non ECC stick, is that right? Does it make ECC less relevant?