top | item 25622476

(no title)

eloy | 5 years ago

He does explain it:

> We have decades of odd random kernel oopses that could never be explained and were likely due to bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude more cases where it just caused a bit-flip that just never ended up being so critical.

It might be false, but I think it's a reasonable assumption.

discuss

IgorPartola|5 years ago

To someone on HN who isn’t familiar with what ECC does that explains nothing about how ECC works and how it could have prevented these situations. Or how often they really happen.

simias|5 years ago

The problem is that, if you don't have ECC to detect the errors, it's very hard to know what exactly caused a random, non-reproducible crash. Especially in kernel mode where there's little memory protection and basically any driver could be writing anywhere at any time.

I can understand Linus's frustration from that point of view: without ECC RAM when you get some super weird crash report where some pointer got corrupted for no apparent reason you can't be sure if it's was just a random bitflip or if it's actually hiding a bigger problem.

chalst|5 years ago

From https://en.m.wikipedia.org/wiki/ECC_memory -

> A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance ’09 conference.[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year

reader_mode|5 years ago

It takes 5 seconds to Google ECC memory if you're really interested and if you're working on kernel related stuff you 99.9999% know what it is.