Honestly, this doesn't sound like a bad batch of drives or the like -- sounds like they weren't doing scrubbing on their RAID.
In case it's helpful, and for general knowledge dissemination:
What likely happened is that a drive "failed". This is usually when the RAID card decides that a drive has had enough command errors that it fails the drive. It may actually be fine, and just had a spate of bad responses. You might try to online the drive again and let it rebuild, but that's debatable.
At any rate, so they replaced the drive. That's fine. But then, to rebuild the RAID back to optimal state, it has to read all of the data off of the other drives. Here's where a bad scrubbing policy bites you -- because if those drives have any sectors that have gone bad or problems with the hardware, those drives might fail as soon as the rebuild runs.
Scrubbing should be done regularly (weekly?). What it does is, in essence, test every sector on all of the disks in the array to make sure that all of them are still fully functional so that -- if there is a failure -- you're pretty sure you can rebuild.
The downside of scrubbing is that, for better or worse, it does exercise your disks fairly heavily. Also, if you don't have a suitable trough period then you might even find it difficult to have the available I/O bandwidth to do it.
I'd recommend ZFS for working with these kind of arrays; scrub is a command that's easy to regularly schedule, and will only use idle I/O bandwidth (helps that the RAID functionality is integrated with the whole filesystem).
If you have an infrequent scrub policy and hit bad sectors on rebuild, it can detect checksum failures and mark specific files as corrupted, rather than declaring all your disks defective. Traditional linux md raid behaviour is particularly bad in this regard: if you have a raid6 configuration and haven't been scrubbing, and then have a single disk failure, all your disks will have a few random isolated bad sectors (i.e. sectors that will URE when you attempt to read them) on, but since you still have one disk's worth of parity it's possible to recover all your data with no downtime (and with ZFS raidz2 this is what would happen). But with md raid as soon as you hit those bad sectors during the rebuild it will consider those drives as failing and kick them out of the array, and since all your drives have at least one bad sector on that means it's impossible to recover the array.
everyone seems to be using this to push their favourite file system, which is fine and all, but if you're using software raid on linux this is the kind of thing you need:
#!/bin/bash
#
# This script checks all RAID devices on the system
# http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Scrubbing
for raid in /sys/block/md*/md/sync_action; do
echo "check" >> ${raid}
done
(the link referred to seems to be down for me at the moment - i just took this from my main machine).
then add a crontab entry to run it once a week or so:
0 0 * * 0 /root/bin/scrub-raid.sh &
you can check the script by running it by hand and then doing
They didn't mention RAID, and they talk about using SSDs. The failure mode you describe (timeouts) is one typical for spinning rust drives, and not for SSDs.
I don't think this applies here. They just moved over to their new RAID setup last Saturday, so they had a major failure within about 72 hours of that transition.
Although the fail characteristics and sizes of SSDs are not the same as conventional harddrives so you can't really make much out of that article in an SSD context.
SSDs still seem to have a bunch of nasty failure cases. Right now, for production use I'm not sure I'd trust the things for reliable storage. As a fast cache for spinning rust? Definitely. As my only live copy, even duplicated in a RAID? Hmm.
Not even Intel SSDs are immune: One of the Debian developers has reported that the SSDs shipped in the latest Thinkpads die if you try and construct an encrypted filesystem on them. Somehow they corrupt themselves during the initial write of random data to the disk.
(Interesting that these SSDs died whilst under high write load too: is this a particular weak point for some reason?)
When your engines stop, you trim for best glide ratio, point yourself at an open field, and fly the plane all the way to the ground. Failure to do so is a great way to get yourself killed.
BTW I should add that I'm still a fan of TOR and have no plans to switch away. In fact, their upfront explanation of what went wrong and how has improved my opinion of the project.
Sympathies. We had a similar thing happen at a previous job in the days of the "deathstar" drives. Lost a drive, no biggie. Tell the DC guy to replace it. Lost a second drive, told him to start running towards our cage. Lost a third drive, uhh - how current are our backups?
Different job - had a developer accidentally run a where-clause-less delete in production. Same net result. RAID and SANs are definitely not backup solutions.
Very much so. For us, raids are mostly time we can use to move all the important data off of that raid. It might be drastic, but it's safe and our data isn't too big.
If your service goes down, be sure to at least have some notion of what your service does on the homepage. Right now, it's only an error report and I had to skim through the blog for a while to figure out that it's a kind of Google Reader replacement.
When your engines stop and you're getting lots of eyeballs, don't assume they all know you.
So, a glide to an emergency landing means that the flight is not over?
Similarly just because a rebuild fails doesn't mean that the data is lost. Just pop in the drives individually and fetch whatever you can from all of them, it is seldom the case that the exact same parts of them become irretrievable at the same time or that the drives stop function completely. It's just some more work than a regular rebuild.
In my experience, all SSDs are a bad batch unless they're Intel 520 or Samsung's high end range.
A big mistake to make is to fill an SSD drive completely up with data. This is a huge no-no unless it's an enterprise drive which usually use a totally different design.
Either those drives are all from the same bad batch, or this isn't a drive problem. I would be thinking either a cooling problem or flaky controllers, and a drive swap in that case is not the solution.
Doubt those drives were from bad batch. As these were SSDs, most likely write wear. Two SSDs should never be in RAID 1 pair or replace at least one SSD in RAID 1 at predetermined schedule instead of replacing on failure only. They are going to wear off at same pace and will fail about the same time.
[+] [-] xb95|12 years ago|reply
In case it's helpful, and for general knowledge dissemination:
What likely happened is that a drive "failed". This is usually when the RAID card decides that a drive has had enough command errors that it fails the drive. It may actually be fine, and just had a spate of bad responses. You might try to online the drive again and let it rebuild, but that's debatable.
At any rate, so they replaced the drive. That's fine. But then, to rebuild the RAID back to optimal state, it has to read all of the data off of the other drives. Here's where a bad scrubbing policy bites you -- because if those drives have any sectors that have gone bad or problems with the hardware, those drives might fail as soon as the rebuild runs.
Scrubbing should be done regularly (weekly?). What it does is, in essence, test every sector on all of the disks in the array to make sure that all of them are still fully functional so that -- if there is a failure -- you're pretty sure you can rebuild.
The downside of scrubbing is that, for better or worse, it does exercise your disks fairly heavily. Also, if you don't have a suitable trough period then you might even find it difficult to have the available I/O bandwidth to do it.
That said, you should do it if you're not.
[+] [-] lmm|12 years ago|reply
If you have an infrequent scrub policy and hit bad sectors on rebuild, it can detect checksum failures and mark specific files as corrupted, rather than declaring all your disks defective. Traditional linux md raid behaviour is particularly bad in this regard: if you have a raid6 configuration and haven't been scrubbing, and then have a single disk failure, all your disks will have a few random isolated bad sectors (i.e. sectors that will URE when you attempt to read them) on, but since you still have one disk's worth of parity it's possible to recover all your data with no downtime (and with ZFS raidz2 this is what would happen). But with md raid as soon as you hit those bad sectors during the rebuild it will consider those drives as failing and kick them out of the array, and since all your drives have at least one bad sector on that means it's impossible to recover the array.
[+] [-] andrewcooke|12 years ago|reply
then add a crontab entry to run it once a week or so:
you can check the script by running it by hand and then doing where you'll see something like: finally, my notes on this - http://www.acooke.org/cute/ScrubbingR0.html[+] [-] fulafel|12 years ago|reply
[+] [-] anigbrowl|12 years ago|reply
I don't think this applies here. They just moved over to their new RAID setup last Saturday, so they had a major failure within about 72 hours of that transition.
[+] [-] achille2|12 years ago|reply
[+] [-] bigiain|12 years ago|reply
"But even today a 7 drive RAID 5 with 1 TB disks has a 50% chance of a rebuild failure. RAID 5 is reaching the end of its useful life. "
http://www.zdnet.com/blog/storage/why-raid-5-stops-working-i...
[+] [-] tjoff|12 years ago|reply
[+] [-] ck2|12 years ago|reply
They are too relatively slow for home computer use but for servers, much more reliable.
That said, this chart concerns me:
http://www.ssdaddict.com/ss/Endurance_cr_20130122.png
25nm Vs 34nm http://google.com/search?q=cache%3Ahttp%3A%2F%2Fwww.xtremesy...
My first personal computer SSD is going to be the Samsung 830 from different batches in Raid1
Also, when someone else builds your servers, you should query the smart info from the drive to make sure they aren't used SSD.
[+] [-] pja|12 years ago|reply
Not even Intel SSDs are immune: One of the Debian developers has reported that the SSDs shipped in the latest Thinkpads die if you try and construct an encrypted filesystem on them. Somehow they corrupt themselves during the initial write of random data to the disk.
(Interesting that these SSDs died whilst under high write load too: is this a particular weak point for some reason?)
[+] [-] coldtea|12 years ago|reply
You should not trust anything for "reliable storage". That's what backups and redundant drives are for.
[+] [-] ronilan|12 years ago|reply
“When all your engines stop, the flight is just starting"
[+] [-] bdonlan|12 years ago|reply
[+] [-] lloeki|12 years ago|reply
[+] [-] anigbrowl|12 years ago|reply
[+] [-] chiph|12 years ago|reply
Different job - had a developer accidentally run a where-clause-less delete in production. Same net result. RAID and SANs are definitely not backup solutions.
[+] [-] tetha|12 years ago|reply
[+] [-] skore|12 years ago|reply
When your engines stop and you're getting lots of eyeballs, don't assume they all know you.
[+] [-] achille|12 years ago|reply
It's still unclear how the storage failure was related to the migration. Was the new engine/fs disruptive to the SSD?
[+] [-] Narkov|12 years ago|reply
[+] [-] nknighthb|12 years ago|reply
Edit re migration connection: Anything that stresses hardware already close to the edge can trigger a failure.
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] coldtea|12 years ago|reply
Pedantic and off-topic as it is, this is incorrect. At least for airplanes.
When all your engines stop you continue to glide, and you can even manage to land succesfully with a little skill and luck.
[+] [-] praptak|12 years ago|reply
[+] [-] tjoff|12 years ago|reply
Similarly just because a rebuild fails doesn't mean that the data is lost. Just pop in the drives individually and fetch whatever you can from all of them, it is seldom the case that the exact same parts of them become irretrievable at the same time or that the drives stop function completely. It's just some more work than a regular rebuild.
[+] [-] rxp|12 years ago|reply
[+] [-] nbevans|12 years ago|reply
A big mistake to make is to fill an SSD drive completely up with data. This is a huge no-no unless it's an enterprise drive which usually use a totally different design.
[+] [-] nknighthb|12 years ago|reply
[+] [-] akg_67|12 years ago|reply