top | item 33517148

Libsecded

109 points| andutu | 3 years ago |pqsrc.cr.yp.to

22 comments

order

viraptor|3 years ago

I wonder how well that paper holds up over a decade later. It reviewed DDR1/2 in 2009. I like to ask people running ECC to check their error counters. (on Linux `edac-util -rfull`) From my very non-scientific survey, memory errors seem to happen significantly less often than this paper would lead you to believe. Then again, running ECC in the first place indicates better hardware than non-ECC, so that's a likely bias.

ploxiln|3 years ago

It can be very hard to get memory error reporting these days.

Bryan Cantrill mentions in one of his talks that Joyent had a datacenter where uncorrectable errors were sporadically halting servers, but no correctable errors were ever counted. He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported.

I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working.

Testing whether a motherboard firmware actually reports ECC errors ... probably doesn't really happen, because it seems to work fine if it doesn't report them, and the company wants to just finish QA and ship. And the rare motherboard that does report errors correctly is more likely to trigger bugs in higher layers that were never actually tested before. And there's pressure to disable or hide this feature to reduce pesky customer support costs. No one else reports any errors, why does your product report errors, I want a replacement, etc.

Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back.

erik|3 years ago

Has anyone tried using software to measure bit-flip rates on non-ECC systems? It seems like a pretty easy task. Turn off swap. Fill a bunch of memory with a known pattern. Every few hours read all the memory and verify that no bits were flipped. If the 2009 result holds on modern systems and a gigabyte of DRAM flips a bit every few hours, then evidence should show up pretty quickly.

adrian_b|3 years ago

At least according to more recent publications and to my own experience, a single gigabyte of good DRAM does not flip a bit every few hours, but every few months.

On non-server computers, which seldom reach peak memory usage, a bit flip may happen in a location where it does no harm.

Nevertheless, after a few years of use, a memory module can start to have frequent bit flips, even many per hour. If you have ECC, you will be notified about this and you will be able to replace the bad module. On most computers without ECC, that can easily lead to undetected data corruption.

Also, when you have 64 GB DRAM, that multiplies by 64 the error frequency.

bugfix-66|3 years ago

https://pqsrc.cr.yp.to/libsecded-20220828/INTERNALS.html

libsecded encodes an n-byte array using an extended Hamming code on the bottom bit of each byte, in parallel an extended Hamming code on the next bit of each byte, etc.

https://en.m.wikipedia.org/wiki/Hamming_code

Extended Hamming codes achieve a Hamming distance of four, which allows the decoder to distinguish between when at most one one-bit error occurs and when any two-bit errors occur. In this sense, extended Hamming codes are single-error correcting and double-error detecting, abbreviated as SECDED.

The main idea is to choose the error-correcting bits such that the index-XOR (the XOR of all the bit positions containing a 1) is 0. We use positions 1, 10, 100, etc. (in binary) as the error-correcting bits, which guarantees it is possible to set the error-correcting bits so that the index-XOR of the whole message is 0. If the receiver receives a string with index-XOR 0, they can conclude there were no corruptions, and otherwise, the index-XOR indicates the index of the corrupted bit.

Hamming codes have a minimum distance of 3, which means that the decoder can detect and correct a single error, but it cannot distinguish a double bit error of some codeword from a single bit error of a different codeword. Thus, some double-bit errors will be incorrectly decoded as if they were single bit errors and therefore go undetected, unless no correction is attempted.

To remedy this shortcoming, Hamming codes can be extended by an extra parity bit. This way, it is possible to increase the minimum distance of the Hamming code to 4, which allows the decoder to distinguish between single bit errors and two-bit errors. Thus the decoder can detect and correct a single error and at the same time detect (but not correct) a double error.

rtpg|3 years ago

I do wonder how many bits in RAM really are "harmlessly flippable". If I took a snapshot of a running machine, how safe is that flip from landing somewhere bad? Perhaps a lot of stuff ends up being write only so fine?

tiagod|3 years ago

My intuition is that it would probably depend on how much memory is being used and how fast.

If you're just decoding a lot of 8K video in SW, maybe most flips will be in the decoded bitmap frame buffer and will no be noticed. On the other hand if you're crunching a lot of data and are storing big, complex data structures in the process, it could more easily break some pointer address or length field and crash your program (or system, if it's kernel stuff).

>If I took a snapshot of a running machine, how safe is that flip from landing somewhere bad

Would be cool to set up KVM to flip a bit in a VM's memory every X amount of time and see how long it will take for weird stuff to happen.

dale_glass|3 years ago

I'm not sure how useful this is, because memory interacts with pretty much everything.

I mean, great: you've validated that the important financial data you were going to write to the DB is correct. But you didn't validate that the OS itself is in full working order. A bit goes out of place, the kernel writes something weird to disk, filesystem becomes corrupted and things explode in a dramatic fashion.

That's exactly why I try to get ECC everywhere these days. I had an old box serving firewall duty until one day it died because it got bumped, a memory module got loose somehow and the resulting disk corruption rendered it unbootable. Applications verifying that their data is correct wouldn't have changed anything.

jedisct1|3 years ago

GitHub mirror, since there doesn't seem to be a proper tarball: https://github.com/jedisct1/libsecded

This also adds a cross-platform build script.

bugfix-66|3 years ago

"Donated this to Microsoft Copilot for you."

In this case djb won't mind (CC0 license).

But it's time to move away from GitHub.

segfaultbuserr|3 years ago

Similar software error-checking techniques are often used in embedded systems. External electromagnetic interference can cause program counter, register and memory corruptions, but hardening the hardware is often prohibitively expensive. When the reliability requirements are not too high, redundant software checks are often a solution - the goal is not to eliminate all failures, but to reduce their probability.

The now-deleted (due to lack of citations) Wikipedia article Immunity-aware programming [0] was a good overview of this topic. Relevant techniques includes:

> Token passing: Every function is tagged with a unique function ID. When the function is called, the function ID is saved in a global variable. The function is only executed if the function ID in the global variable and the ID of the function match. If the IDs do not match, an instruction pointer error has occurred, and specific corrective actions can be taken. [...] This is essentially an "arm / fire" sequencing, for every function call. Requiring such a sequence is part of safe programming techniques, as it generates tolerance for single bit (or in this case, stray instruction pointer) faults.

> Data duplication: To cope with corruption of data, multiple copies of important registers and variables can be stored. Consistency checks between memory locations storing the same values, or voting techniques, can then be performed when accessing the data. [...] When the data is read out, the two sets of data are compared. A disturbance is detected if the two data sets are not equal. An error can be reported. If both sets of data are corrupted, a significant error can be reported and the system can react accordingly.

> [...] CRCs are calculated before and after transmission or duplication, and compared to confirm that they are equal. A CRC detects all one- or two-bit errors, all odd errors, all burst errors if the burst is smaller than the CRC, and most of the wide-burst errors. Parity checks can be applied to single characters (VRC—vertical redundancy check), resulting in an additional parity bit or to a block of data (LRC—longitudinal redundancy check), issuing a block check character. Both methods can be implemented rather easily by using an XOR operation. A trade-off is that less errors can be detected than with the CRC. Parity Checks only detect odd numbers of flipped bits. The even numbers of bit errors stay undetected. A possible improvement is the usage of both VRC and LRC, called Double Parity or Optimal Rectangular Code (ORC).

> Function parameter duplication: Parameters passed to procedures, as well as return values, are considered to be variables. Hence, every procedure parameter is duplicated, as well as the return values. A procedure is still called only once, but it returns two results, which must hold the same value. The source listing to the right shows a sample implementation of function parameter duplication.

> Test/branch duplication: To duplicate a [if-else] test at multiple locations in the program. [...] For every conditional test in the program, the condition and the resulting jump should be reevaluated, as shown in the figure. Only if the condition is met again, the jump is executed, else an error has occurred.

None of the mainstream compiler has these features, often programmers do all of these tasks by hand (!) in C. If someone implements these kinds of features to GCC or LLVM/clang (similar to how buffer overflow exploits are mitigated by automatic stack canary or Control-Flow Integrity checks), it would be a major contribution to the entire world of embedded system development.

[0] https://web.archive.org/web/20180519034600/https://en.wikipe...

rt12121212|3 years ago

Thanks for the link.

What if instead of passing tokens, checksums were passed and the function checked that its code matched the checksum. This would give some protection against both corruption of the code and instruction pointer errors.

Another element from the article was having copies of the function and comparing the return values, but I suspect this breaks down when the function deals with external state. Possibly it could be done by intercepting the state-related calls and making them atomic/combining them. I feel like there's something here reminding me of STM [0].

I suspect it will always be a better investment of time and result in scalable and simpler applications to go for the hardware required to get a full ECC-covered execution architecture.

[0] https://www.infoq.com/news/2010/05/STM-Dropped/

ece|3 years ago

This would be interesting to see in a JIT, even if on a sampling basis. I also wonder if some instruction filtering/detection approach would work for rowhammer.

CalChris|3 years ago

Isn't LPDDR5 in the M2 supporting ECC? I believe it corrects errors but doesn't report them, no?

adrian_b|3 years ago

The ECC in DDR5/LPDDR5 corrects only internal errors and it has this extra facility only to counteract the degradation of reliability vs. DDR4/LPDDR4, due to smaller cells and faster operation.

It does not really increase much the reliability over older generations, all the mentions about internal ECC are mostly marketing BS.

The ECC that is implemented in the memory controller inside the CPU package protects not only against bit flips in the DRAM arrays, but also against bit errors that happen elsewhere on the long way between memory chips and CPU chips, due to electrical noise, bad seating of CPUs or memory modules in their sockets or cheap sockets whose contacts become oxidized in time.

Due to the increased memory throughput, the links between CPU and memory become more and more sensitive to electrical noise at every new generation.

On laptops or small computers where both the CPUs and the memory chips are soldered on the same PCB, or they are stacked in the same package, ECC is somewhat less important, but on any computer with socketed memory modules ECC should have been mandatory.

fulafel|3 years ago

A web search doesn't bring up any references to this feature, other than the bus layer error coding for signal integrity in transit that is standard in LPDDR5.

Non-LP (and thus Non-Apple) DDR5 does have ECC.

An additional twist here is that apparently the ECC was added to DDR5 because process shrinks and memory size increases have caused an increase in bit flips, so this is needed to keep reliability at the previous non-ECC level. There's an additional "actually robust" level of ECC, which is still sold separately. [1]

I guess we might ask why LPDDR5 is missing the DDR5 equivalent "keep running to stay in the same place" ECC, and what this means to reliability...

[1] https://en.wikipedia.org/wiki/DDR5_SDRAM#DIMMs_versus_memory...

throwaway81523|3 years ago

For a large array maybe you are better off with e.g. a Reed-Solomon code instead of a Hamming code.