What does uncorrectable error mean in RAM? I’ve read somewhere that they use Hamming codes for ECC RAM. Isn’t it the case that too many errors (more than floor((d - 1) / 2)) simply result in an incorrect codeword? Or do they know the error locations a priori? (i.e. erasure coding)
derefr|3 years ago
First, let's define a correctable error — that means an error where you have enough additional information (in your Hamming-code stream, in a parity bit, whatever) to repair the error, e.g. a one-bit error when you're using ECC RAM.
An uncorrectable error, then, is one where you do have enough information to detect the corruption, but not enough information to figure out the correct repair for the corruption. (The amount of information required to detect corruption is always less than the amount of information required to correct it.)
With ECC RAM, usually exactly two bit-flips will produce a detected, but uncorrectable, error on read-back.
With non-ECC but "with a parity-bit per word" RAM (which exists, and is a bit cheaper than ECC RAM), you can't correct anything, only detect. (Which is sometimes all you need, if you're willing to do the calculation over again.)
All that being said: completely separately from these hardware-level features, some operating systems (e.g. macOS) compress memory, and generate page-level checksums of memory pages as they're compressed. A checksum failure during memory-page decompression can also trigger the kernel to throw this kind of "uncorrectable error" itself.
There may also exist (i.e. it wouldn't be impossible for there to be) RAM modules that continuously calculate page checksums for each page on each write; and then check the contents of pages against these, perhaps asynchronously, in sort of the same way a ZFS scrub works. I've never heard of this being done, but it feels like the sort of thing you'd implement in hardware for an extremely "ruggedized" system like a Mars rover. If this approach were to be implemented, it would also emit "uncorrectable" errors.
justsomehnguy|3 years ago
>> HPE Fast Fault Tolerant (ADDDC)—Enables the system to correct memory errors and continue to operate in cases of multiple DRAM device failures on a DIMM. Provides protection against uncorrectable memory errors beyond what is available with Advanced ECC.
https://techlibrary.hpe.com/docs/iss/proliant-gen10-uefi/s_c...
https://www.hpe.com/us/en/collaterals/collateral.4aa4-3490.M...
hansvm|3 years ago
However, you can cheaply extend Hamming codes with, e.g., a parity bit, so that errors in a slightly larger radius are detectable, though for obvious reasons you couldn't correct such an error.
No comment on what sort of algorithm is used for ECC, though it might be worth mentioning that the above is a pretty general feature of error correction, where it's possible to cheaply or even for free be able to detect errors in a larger radius than you're able to correct.