I wish there was a hard requirement for ECC, as a developer working on GPU drivers, there's a huge amount of reported issues that just... don't make sense? One offs with slightly different symptoms, memory dumps of nonsense, just nowhere to start rooting out the cause for an issue. Even on "widely reported" issues that make it to reddit and similar.
Probably not surprising, there's a naturally antagonistic relationship between Performance and Reliability here, and it's clear which way many of those "enthusiast" forums lean.
I haven't got actual numbers, but I feel that most [0] of the issues I start looking at just can never be reproduced, or even make sense from the backtrace or similar. I can't say it's 100% hardware issues for this, as many games are a little... loose... with reliability if it works "well enough", and is heavily interacting with code and data we work on so might also be a source for "impossible" issues. But even on straightforward code paths, no weird OS interaction, no allocation, nothing async etc. "Impossible" states happen pretty regularly.
I would love there to be enough ECC-using gamers out there to statistically see if it makes a difference.
[0] Most in terms of number of different issues, not total reports of the same issue. That's dominated by one or two things, normally around the latest game or update doing something dumb :P
Not having ECC is the biggest scam in computing. Ever hear of "bitrot"? That's memory errors that have been saved to disk. We have made millions of people lose their data so that servers can be artificially more expensive.
Intel was responsible for most of this. It is hard to be sad when seeing how they've lost the market lead.
I would wager 99.99% of bitrot is silent corruption that goes unnoticed until it affects something particularly important. Without integrity and error correction in all paths along a processor, storage hierarchy, and network paths and at rest, there's no way to prove a system will ever remain reliable.
I wonder if it's practical to do ECC on 64 bit words instead of bytes. A 13% price increase (or capacity drop for the same DRAM price) is substantial and might justify the penny pinching, 1.5% is negligible if it leads to a similar stability increase as standard ECC DRAM. If you are often getting more then one bit flip per 64 bit word, than that RAM is garbage anyway.
Huh? I thought that "bitrot" was like content saved into the disk and the disk left unpowered for an extended period of time causing data loss. And I thought that the content stored on the disk has ECC on its own?
PC Engines $150 APU2 (RIP) shipped with 4GB ECC RAM and AMD Embedded CPU. Since it was a headless device used mostly for 1GbE networking, the RAM was throttled and relatively impervious to Rowhammer.
QNAP has a $600 1U short-depth (11") 4x3.5 2xM.2 2x10GbE 2x2.5GbE 4-32GB DDR4 SODIMM Arm NAS that would benefit from OSS community attention. Based on a Marvell/Armada CN9130 SoC which supports ECC, has mainline Linux support, and public-but-non-upstream code for uboot [2]. With local serial console and a bit of effort, the QNAP OS can be replaced by Arm Debian/Devuan with ZFS. Rare combo of low power, small size, fast network, ECC memory and upstream-friendly Linux. QNAP also sell a 10GbE router based on the same SoC.
So I have to say, ECC memory is definitely something we should not have gotten away from for consumer hardware. My current PC, which is rocking a Core i9 14900k (pray for me) and an ASUS W680M ACE SE motherboard, allowed me to install some 5600MHz speed DDR5 ECC memory, and it works flawlessly.
The only downside in my view is the cost. Unbuffered ECC and the cost of using a workstation class chipset really pushes this into luxury territory. Plus, I'm never too sure what Intel's future plans are for successor processors and chipsets, which is why I settled on W680. I don't really want to go full blown Xeon.
the more significant downside than cost is speed, with your DDR5-5600 ECC having most likely a latency of 16ns, while DDR5-7000 non-ECC 12ns (10ns if you are only interested in Column Strobe) is available for your platform, which has 25% more Bandwidth while also featuring a 25% - 40% lower latency.
Dont be fooled (like me) by the DDR5-6000/6400/6800 ECC registered Modules, all Desktop Motherboards only support unbuffered Modules, and most dont even support DDR5-5600 ECC, only DDR5-4800/5200 ECC.
> Unbuffered ECC and the cost of using a workstation class chipset really pushes this into luxury territory. Plus, I'm never too sure what Intel's future plans are for successor processors and chipsets, which is why I settled on W680.
Some cheap AMD motherboards support ECC. But the future is unknown. Ryzen 8000 CPUs don't.
Even more interestingly in the comment below, apparently you can just flick on ECC for RTX 4090 cards!
Extra weird is Nvidia singling out ray tracing as a use case which shouldn't use ECC... I suppose it's no biggie if a single ray goes the wrong way down the BVH, out of trillions.
As soon as I worked with high-load servers in the 90s, it's already clear that ECC should be the default.
Intel's marketing ploy on ECC is very desctructive and have costed many parties a lot of wasted time money & resources to handle problems caused by non-ECC memory ; and Linux Torvalds is absolutely right in roasting them for this.
Memory corruptions can impact very differently. A sample of decoded music getting corrupted leads to a small glitch, maybe even inaudible. An instruction in executable code getting corrupted can leads to all sorts of havoc.
Since ECC is seemingly not getting mandatory, I've been wishing CPUs would support "soft-ECC". That is, the OS could mark certain pages as needing "soft-ECC", and the CPU would then store (at least) three copies of that page in RAM. When reading such pages back from RAM the CPU would read all physical copies and compare. If the majority agrees it can use that, otherwise raise an error.
This could then be used for executable pages and important configuration data which occupies relatively few pages, and where integrity matters a lot more than speed.
There's probably some good reasons why this is non-trivial to implement, I've forgotten most of what I learned about the virtual memory implementation in CPUs. But a man can dream...
The triple reading/writing to memory along with the comparing would probably be a significant performance hit. You could just use a bit of extra memory for parity bits etc. instead
DDR5 comes with on-die ECC. My understanding is this only checks errors occuring within the RAM itself, not errors that occur during transmission to and from RAM.
My question is, how common are transmission errors over errors happening within RAM?
LPDDR4/4X has also had on-die ECC for a while (at least the chips I'm used to, like in the Raspberry P); with such small lithography it's basically required to get the ram to work reliably.
I would love to put ECC in my Desktop computers, however its more expensive (ok), is not officially supported on most Desktop Motherboards (and in reality does not work in "ECC-Mode" on the majority of them) and finally: ECC Ram available to purchase is painfully slow, in both bandwidth (:/) and latency (://)
Please, I'd love someone to tell me how to find and buy computers that support ECC. I am looking to buy a NUC/mini-server, and they basically all sell with non-ECC RAM. Last time I asked on this forum, I was told that on Intel, only Xeon processors support ECC, while all modern (?) AMD CPU support them. Elsewhere I read that was matter is the mobo needs to support it. I have no idea how to go about it.
So, let me ask again. I was to buy a NUC new or off Ebay, how can I be 100% sure it works with ECC RAM without having to spend half a hour researching CPU, mobo and BIOS specs for each single product I come across?
If I had a budget in the thousands, I would go with a Xeon server that comes with ECC pre-installed. I don't and have modest needs. I only want to splurge on ECC RAM to replace the original sticks.
(No "you don't need ECC for a NUC" reply please. That is not the point of my question, yet it is a far too common response)
I have never seen any true NUC-like computer that supports ECC SODIMMs. Intel has also used the NUC brand for a much larger computer that supported laptop Xeon CPUs, but that line has been abandoned.
There have been some NUC-like computers from ASRock industrial, Supermicro and others, with either Tiger Lake or Elkhart Lake CPUs, where you could enable in BIOS the so-called in-band ECC.
All these models are obsolete. Moreover, in-band ECC is an ugly and inefficient workaround. It can be used with soldered LPDDR memories, which do not have ECC variants, but it has worse performances than standard ECC. It is not cheaper, because it diminishes the memory capacity in the same ratio as any ECC and it requires a greater die area inside the CPU for its implementation (including a dedicated cache memory for the ECC bits).
There are many mini-ITX motherboards that support ECC (but you must check carefully the specifications, even if they are for AMD CPUs). For a smaller size than mini-ITX, there are 2 choices, either expensive industrial single-board computers, which usually have the 3.5" form factor of the PCB, or one of the so-called mobile workstation laptops from Dell, HP or Lenovo, e.g. a Dell Precision mobile workstation, which are also much more expensive than an equivalent NUC-like computer.
So, if a low price is desired and an up-to-date fast CPU, you cannot have ECC in form factors smaller than mini-ITX. If paying double or triple is not a problem, there are solutions.
If you want a preassembled small computer with ECC and a mini-ITX motherboard, there are some at companies like ASRock Rack or Supermicro, but they are much more expensive than if you get the best components and you assemble them yourself.
Recent AMD processors need to be AMD pro series to support ECC. Motherboard support is also required. On intels side I think standard Xeon type commercial boards very often support it. Unfortunately you’ll likely need to ask around when buying to ensure it supports it. If you can, getting a mini ATX known-good mobo in a small case may be easier.
I was on the NUC search a while ago and I'm not sure you can. Although the AMD motherboards may not support ECC, I haven't heard of any that actually don't. Best bet is probably to buy a recent, barebones AMD NUC, and buy the ECC RAM yourself. Sometimes they'll advertise ECC support as well.
What’s the best price/performance for a home lab server running Linux with ECC these days? Bonus points if it is rackable.
Sadly, my go-to Linux hardware manufacturers either don’t offer ECC RAM, or only offer it as an option on their absolute top-end machines. Yes, yes, the extra two thousand dollars for a machine with a six-year lifespan probably is worth it on a monthly basis, but man it still hurts.
> What’s the best price/performance for a home lab server running Linux with ECC these days? Bonus points if it is rackable.
Old used enterprise server. None of them will be great at power/performance in typical (i.e. mostly idle) home use tho. Intel ones usually far better here
I recently(ish) built a new home server using an cheap AM5 motherboard from ASUS that supports ECC. Good performance and power usage is around 45 W idle with a couple of SSDs and a couple of HDDs spinning.
Not the cheapest, but I wanted to keep power consumption low for noise and reduced heating while still having good performance if needed.
I also considered a motherboard with IPMI on AM5(Asrock rack), but that was much more expensive.
> From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not.
> For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink.
Nice. That got me curious, how many electrons are in today's DRAM capacitor? I tried searching but haven't found any recent info.
I tried building an old rig (maybe ~7 years ago or so) using ECC RAM (since I was running two Xeons). It was such a pain in the butt to get it to boot and find sticks that were compatible with each other, don't really want to go down that path again.
[+] [-] kimixa|1 year ago|reply
Probably not surprising, there's a naturally antagonistic relationship between Performance and Reliability here, and it's clear which way many of those "enthusiast" forums lean.
I haven't got actual numbers, but I feel that most [0] of the issues I start looking at just can never be reproduced, or even make sense from the backtrace or similar. I can't say it's 100% hardware issues for this, as many games are a little... loose... with reliability if it works "well enough", and is heavily interacting with code and data we work on so might also be a source for "impossible" issues. But even on straightforward code paths, no weird OS interaction, no allocation, nothing async etc. "Impossible" states happen pretty regularly.
I would love there to be enough ECC-using gamers out there to statistically see if it makes a difference.
[0] Most in terms of number of different issues, not total reports of the same issue. That's dominated by one or two things, normally around the latest game or update doing something dumb :P
[+] [-] jrockway|1 year ago|reply
Intel was responsible for most of this. It is hard to be sad when seeing how they've lost the market lead.
[+] [-] hi-v-rocknroll|1 year ago|reply
[+] [-] cornholio|1 year ago|reply
[+] [-] ajross|1 year ago|reply
Only in the sense of "Intel is responsible for most of computation" ... No one uses ECC pervasively anywhere, that's sort of the point of the article.
[+] [-] jan_Sate|1 year ago|reply
[+] [-] transpute|1 year ago|reply
QNAP has a $600 1U short-depth (11") 4x3.5 2xM.2 2x10GbE 2x2.5GbE 4-32GB DDR4 SODIMM Arm NAS that would benefit from OSS community attention. Based on a Marvell/Armada CN9130 SoC which supports ECC, has mainline Linux support, and public-but-non-upstream code for uboot [2]. With local serial console and a bit of effort, the QNAP OS can be replaced by Arm Debian/Devuan with ZFS. Rare combo of low power, small size, fast network, ECC memory and upstream-friendly Linux. QNAP also sell a 10GbE router based on the same SoC.
Ryzen Pro (OEM) can support ECC [3].
[1] https://www.qnap.com/en-us/product/ts-435xeu
[2] https://solidrun.atlassian.net/wiki/spaces/developer/pages/3...
[3] https://www.tomshardware.com/pc-components/cpus/amd-confirms...
[+] [-] TheAmazingRace|1 year ago|reply
The only downside in my view is the cost. Unbuffered ECC and the cost of using a workstation class chipset really pushes this into luxury territory. Plus, I'm never too sure what Intel's future plans are for successor processors and chipsets, which is why I settled on W680. I don't really want to go full blown Xeon.
[+] [-] Sweepi|1 year ago|reply
Dont be fooled (like me) by the DDR5-6000/6400/6800 ECC registered Modules, all Desktop Motherboards only support unbuffered Modules, and most dont even support DDR5-5600 ECC, only DDR5-4800/5200 ECC.
[+] [-] pixelpoet|1 year ago|reply
[+] [-] pseudalopex|1 year ago|reply
Some cheap AMD motherboards support ECC. But the future is unknown. Ryzen 8000 CPUs don't.
[+] [-] summerlight|1 year ago|reply
Interestingly, Jeff Atwood has changed his mind on ECC memory.
[+] [-] pixelpoet|1 year ago|reply
Extra weird is Nvidia singling out ray tracing as a use case which shouldn't use ECC... I suppose it's no biggie if a single ray goes the wrong way down the BVH, out of trillions.
[+] [-] sufehmi|1 year ago|reply
Intel's marketing ploy on ECC is very desctructive and have costed many parties a lot of wasted time money & resources to handle problems caused by non-ECC memory ; and Linux Torvalds is absolutely right in roasting them for this.
[+] [-] magicalhippo|1 year ago|reply
Since ECC is seemingly not getting mandatory, I've been wishing CPUs would support "soft-ECC". That is, the OS could mark certain pages as needing "soft-ECC", and the CPU would then store (at least) three copies of that page in RAM. When reading such pages back from RAM the CPU would read all physical copies and compare. If the majority agrees it can use that, otherwise raise an error.
This could then be used for executable pages and important configuration data which occupies relatively few pages, and where integrity matters a lot more than speed.
There's probably some good reasons why this is non-trivial to implement, I've forgotten most of what I learned about the virtual memory implementation in CPUs. But a man can dream...
[+] [-] HideousKojima|1 year ago|reply
[+] [-] grog454|1 year ago|reply
[+] [-] Animats|1 year ago|reply
[+] [-] thfuran|1 year ago|reply
[+] [-] wmf|1 year ago|reply
[+] [-] ido|1 year ago|reply
[+] [-] snvzz|1 year ago|reply
The FCC could just not allow computers to ship it without.
CPU makers like Intel and AMD could simply have their CPUs not work with non-ECC RAM.
Microsoft could e.g. require ECC RAM for Windows 12.
It is insanity that most computers shipping today do not use ECC and are thus unreliable.
With luck they'll crash, but most likely they will fail silently, while corrupting data.
[+] [-] dang|1 year ago|reply
Why Use ECC? (2015) - https://news.ycombinator.com/item?id=25167288 - Nov 2020 (98 comments)
Why Use ECC Memory? - https://news.ycombinator.com/item?id=23361577 - May 2020 (2 comments)
Should I buy ECC memory? (2015) - https://news.ycombinator.com/item?id=14206635 - April 2017 (224 comments)
Why use ECC? - https://news.ycombinator.com/item?id=10638324 - Nov 2015 (95 comments)
[+] [-] hi-v-rocknroll|1 year ago|reply
https://cr.yp.to/hardware/ecc.html (2001)
DEF CON 19 - Artem Dinaburg - Bit-squatting DNS Hijacking Without Exploitation (2011)
https://youtu.be/aT7mnSstKGs
https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20vid...
[+] [-] anonymousiam|1 year ago|reply
[+] [-] Nerada|1 year ago|reply
My question is, how common are transmission errors over errors happening within RAM?
[+] [-] pclmulqdq|1 year ago|reply
Adding protocol-level ECC on top only helps, although it is somewhat inefficient.
[+] [-] gjjydfhgd|1 year ago|reply
You have no idea if you have tons of errors and how many were corrected.
[+] [-] hi-v-rocknroll|1 year ago|reply
https://en.wikipedia.org/wiki/Chipkill
DRAM Errors in the Wild: A Large-Scale Field Study (2009)
https://static.googleusercontent.com/media/research.google.c...
[+] [-] geerlingguy|1 year ago|reply
[+] [-] Sweepi|1 year ago|reply
[+] [-] sph|1 year ago|reply
So, let me ask again. I was to buy a NUC new or off Ebay, how can I be 100% sure it works with ECC RAM without having to spend half a hour researching CPU, mobo and BIOS specs for each single product I come across?
If I had a budget in the thousands, I would go with a Xeon server that comes with ECC pre-installed. I don't and have modest needs. I only want to splurge on ECC RAM to replace the original sticks.
(No "you don't need ECC for a NUC" reply please. That is not the point of my question, yet it is a far too common response)
[+] [-] adrian_b|1 year ago|reply
There have been some NUC-like computers from ASRock industrial, Supermicro and others, with either Tiger Lake or Elkhart Lake CPUs, where you could enable in BIOS the so-called in-band ECC.
All these models are obsolete. Moreover, in-band ECC is an ugly and inefficient workaround. It can be used with soldered LPDDR memories, which do not have ECC variants, but it has worse performances than standard ECC. It is not cheaper, because it diminishes the memory capacity in the same ratio as any ECC and it requires a greater die area inside the CPU for its implementation (including a dedicated cache memory for the ECC bits).
There are many mini-ITX motherboards that support ECC (but you must check carefully the specifications, even if they are for AMD CPUs). For a smaller size than mini-ITX, there are 2 choices, either expensive industrial single-board computers, which usually have the 3.5" form factor of the PCB, or one of the so-called mobile workstation laptops from Dell, HP or Lenovo, e.g. a Dell Precision mobile workstation, which are also much more expensive than an equivalent NUC-like computer.
So, if a low price is desired and an up-to-date fast CPU, you cannot have ECC in form factors smaller than mini-ITX. If paying double or triple is not a problem, there are solutions.
If you want a preassembled small computer with ECC and a mini-ITX motherboard, there are some at companies like ASRock Rack or Supermicro, but they are much more expensive than if you get the best components and you assemble them yourself.
[+] [-] user_7832|1 year ago|reply
[+] [-] petronio|1 year ago|reply
[+] [-] P_I_Staker|1 year ago|reply
[+] [-] eadmund|1 year ago|reply
Sadly, my go-to Linux hardware manufacturers either don’t offer ECC RAM, or only offer it as an option on their absolute top-end machines. Yes, yes, the extra two thousand dollars for a machine with a six-year lifespan probably is worth it on a monthly basis, but man it still hurts.
[+] [-] adql|1 year ago|reply
Old used enterprise server. None of them will be great at power/performance in typical (i.e. mostly idle) home use tho. Intel ones usually far better here
[+] [-] NorwegianDude|1 year ago|reply
Not the cheapest, but I wanted to keep power consumption low for noise and reduced heating while still having good performance if needed.
I also considered a motherboard with IPMI on AM5(Asrock rack), but that was much more expensive.
Worked out quite nicely.
[+] [-] Palomides|1 year ago|reply
or whatever dell 730 or something fits your budget
[+] [-] BlueTemplar|1 year ago|reply
[+] [-] oskarkk|1 year ago|reply
Nice. That got me curious, how many electrons are in today's DRAM capacitor? I tried searching but haven't found any recent info.
[+] [-] dvt|1 year ago|reply
[+] [-] danparsonson|1 year ago|reply
[+] [-] _factor|1 year ago|reply
Binning is likely the problem.
[+] [-] Fnoord|1 year ago|reply
[+] [-] bjoli|1 year ago|reply
[+] [-] nextaccountic|1 year ago|reply
[+] [-] nottorp|1 year ago|reply
[+] [-] forty|1 year ago|reply