top | item 40296628

(no title)

adhoc32 | 1 year ago

I've been there, and it was a pain. All my backups were corrupted due to a faulty RAM module. Initially, I blamed the hard drives because they seemed to be failing right before my eyes. I was copying a large file; sometimes it copied okay, but occasionally it would become corrupted. Since then, I've been paying a premium for ECC.

discuss

order

TheCondor|1 year ago

Same experience. We were doing all the things, regular backups, rotating them, verifying them. During a weekly verification test, it failed. Tested some older backups and they failed too! If the data matters, it’s hard to express the stress and disconcert you feel in this moment.

Memory is different from all other resources in the system. We are conditioned as engineers, we know drives fail more frequently than other resources. When memory fails it is indistinguishable from a drive failure. There are some system behaviors that matter too, we tend to think that page allocation is random and on heavily loaded systems it appears to be, but on specialized systems it can be rather consistent so the verification can fail in nearly the same place, repeatedly. Riddle me this: what is more likely? A memory failure, a drive failure, or a postgresql bug that results in a corrupted row? Badblocks checks out on the server’s disks… if the data matters, it is extremely unpleasant going through that whole thing, it’s crystal clear after the fact but it’s a bloody nightmare in the heat of it all.