I once had a small fleet of SSDs fail because they had some uptime counters that overflowed after 4.5 years, and that somehow persistently wrecked some internal data structures. It turned them into little, unrecoverable bricks.
It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.
I had a similar issue, but it was a single RAID-5 array and wear of some other manufacture defect. They were the same brand, model, and batch. When the first failed and the array got in recovery mode I ordered 3 replacements and upped the backup frequency. It was good that I did that because the two remaining drives died shortly after.
The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.
There's a principle in aviation of staggering engine maintenance on multiple-engined airplanes to avoid maintenance-induced errors leading to complete power loss.
Yeah just coming here to say this. Multiple disk failures are pretty probable. I've had batches of both disks and SSDs with sequential serial numbers, subjected to the same workloads, all fail within the same ~24 hour periods.
This is why I try to mismatch manufacturers in RAID arrays. I'm told there is a small performance hit (things run towards the speed of the slowest, separately in terms of latency and throughput) but I doubt the difference is high and I like the reduction in potential failure-during-rebuild rates. Of course I have off-machine and off-site backups as well as RAID, but having to use them to restore a large array would be a greater inconvenience than just being able to restore the array (followed by checksum verifies over the whole lot for paranoia's sake).
Eek - now I'm glad I wait a few months before buying each disk for my NAS.
Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.
That's why serious SAN vendors take care to provide you a mix of disks (e.g. on a brand new NetApp you can see that disks are of 2-3 different types, and with quite different serial numbers).
Or even if the power supplies were purchased around the same time. I had a batch of servers that as soon as they arrived started chewing through hard drives. It took about 10 failed drives before I realized it was a problem with the power supplies.
Anyone familiar with car repair will tell you that if one headlight burns out you should just go ahead and replace both, because of this exact phenomenon. I suppose with LEDs we may not have to worry about it anymore
Even if they're not the same, they're written at the same time and rate, meaning they have the same wear over time, subject to the same power/heat issues, etc.
Hopefully, regularly checking the disks' S.M.A.R.T status will help you stay on top of issues caused by those factors.
Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.
kabdib|3 years ago
It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.
mikiem|3 years ago
rbanffy|3 years ago
The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.
mcsee|3 years ago
[deleted]
perilunar|3 years ago
e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf
spiffytech|3 years ago
bragr|3 years ago
clintonwoo|3 years ago
I guess proper redundancy is having different brands of equipment also in some cases.
dspillett|3 years ago
GekkePrutser|3 years ago
Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.
sofixa|3 years ago
unknown|3 years ago
[deleted]
bink|3 years ago
adrianmonk|3 years ago
hallway_monitor|3 years ago
0xbadcafebee|3 years ago
pmoriarty|3 years ago
Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.