I've been there, and it was a pain. All my backups were corrupted due to a faulty RAM module. Initially, I blamed the hard drives because they seemed to be failing right before my eyes. I was copying a large file; sometimes it copied okay, but occasionally it would become corrupted. Since then, I've been paying a premium for ECC.
TheCondor|1 year ago
Memory is different from all other resources in the system. We are conditioned as engineers, we know drives fail more frequently than other resources. When memory fails it is indistinguishable from a drive failure. There are some system behaviors that matter too, we tend to think that page allocation is random and on heavily loaded systems it appears to be, but on specialized systems it can be rather consistent so the verification can fail in nearly the same place, repeatedly. Riddle me this: what is more likely? A memory failure, a drive failure, or a postgresql bug that results in a corrupted row? Badblocks checks out on the server’s disks… if the data matters, it is extremely unpleasant going through that whole thing, it’s crystal clear after the fact but it’s a bloody nightmare in the heat of it all.