“When all your engines stop, the flight is over”

[+] xb95|12 years ago|reply

Honestly, this doesn't sound like a bad batch of drives or the like -- sounds like they weren't doing scrubbing on their RAID.

In case it's helpful, and for general knowledge dissemination:

What likely happened is that a drive "failed". This is usually when the RAID card decides that a drive has had enough command errors that it fails the drive. It may actually be fine, and just had a spate of bad responses. You might try to online the drive again and let it rebuild, but that's debatable.

At any rate, so they replaced the drive. That's fine. But then, to rebuild the RAID back to optimal state, it has to read all of the data off of the other drives. Here's where a bad scrubbing policy bites you -- because if those drives have any sectors that have gone bad or problems with the hardware, those drives might fail as soon as the rebuild runs.

Scrubbing should be done regularly (weekly?). What it does is, in essence, test every sector on all of the disks in the array to make sure that all of them are still fully functional so that -- if there is a failure -- you're pretty sure you can rebuild.

The downside of scrubbing is that, for better or worse, it does exercise your disks fairly heavily. Also, if you don't have a suitable trough period then you might even find it difficult to have the available I/O bandwidth to do it.

That said, you should do it if you're not.

[+] lmm|12 years ago|reply

I'd recommend ZFS for working with these kind of arrays; scrub is a command that's easy to regularly schedule, and will only use idle I/O bandwidth (helps that the RAID functionality is integrated with the whole filesystem).

If you have an infrequent scrub policy and hit bad sectors on rebuild, it can detect checksum failures and mark specific files as corrupted, rather than declaring all your disks defective. Traditional linux md raid behaviour is particularly bad in this regard: if you have a raid6 configuration and haven't been scrubbing, and then have a single disk failure, all your disks will have a few random isolated bad sectors (i.e. sectors that will URE when you attempt to read them) on, but since you still have one disk's worth of parity it's possible to recover all your data with no downtime (and with ZFS raidz2 this is what would happen). But with md raid as soon as you hit those bad sectors during the rebuild it will consider those drives as failing and kick them out of the array, and since all your drives have at least one bad sector on that means it's impossible to recover the array.

[+] andrewcooke|12 years ago|reply

everyone seems to be using this to push their favourite file system, which is fine and all, but if you're using software raid on linux this is the kind of thing you need:

    #!/bin/bash
    #
    # This script checks all RAID devices on the system
    # http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Scrubbing
    for raid in /sys/block/md*/md/sync_action; do
        echo "check" >> ${raid}
    done

(the link referred to seems to be down for me at the moment - i just took this from my main machine).

then add a crontab entry to run it once a week or so:

    0 0 * * 0 /root/bin/scrub-raid.sh &

you can check the script by running it by hand and then doing

    cat /proc/mdstat

where you'll see something like:

    Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] [linear]
    md1 : active raid1 sdb1[0] sdc1[1]
          976760640 blocks super 1.0 [2/2] [UU]
          [=>...................]  check =  6.0% (59419520/976760640) finish=191.3min speed=79902K/sec
          bitmap: 0/8 pages [0KB], 65536KB chunk

finally, my notes on this - http://www.acooke.org/cute/ScrubbingR0.html

[+] fulafel|12 years ago|reply

They didn't mention RAID, and they talk about using SSDs. The failure mode you describe (timeouts) is one typical for spinning rust drives, and not for SSDs.

[+] anigbrowl|12 years ago|reply

Scrubbing should be done regularly (weekly?)

I don't think this applies here. They just moved over to their new RAID setup last Saturday, so they had a major failure within about 72 hours of that transition.

[+] achille2|12 years ago|reply

Note that they did periodic backups, which would have scrubbed the same sectors as a full resync, so that's not the case.

[+] bigiain|12 years ago|reply

Reminds me of this (from many years ago):

"But even today a 7 drive RAID 5 with 1 TB disks has a 50% chance of a rebuild failure. RAID 5 is reaching the end of its useful life. "

http://www.zdnet.com/blog/storage/why-raid-5-stops-working-i...

[+] tjoff|12 years ago|reply

Although the fail characteristics and sizes of SSDs are not the same as conventional harddrives so you can't really make much out of that article in an SSD context.

[+] ck2|12 years ago|reply

If you aren't using Intel SSD in a server environment, you are going to have a bad time.

They are too relatively slow for home computer use but for servers, much more reliable.

That said, this chart concerns me:

http://www.ssdaddict.com/ss/Endurance_cr_20130122.png

25nm Vs 34nm http://google.com/search?q=cache%3Ahttp%3A%2F%2Fwww.xtremesy...

My first personal computer SSD is going to be the Samsung 830 from different batches in Raid1

Also, when someone else builds your servers, you should query the smart info from the drive to make sure they aren't used SSD.

[+] pja|12 years ago|reply

SSDs still seem to have a bunch of nasty failure cases. Right now, for production use I'm not sure I'd trust the things for reliable storage. As a fast cache for spinning rust? Definitely. As my only live copy, even duplicated in a RAID? Hmm.

Not even Intel SSDs are immune: One of the Debian developers has reported that the SSDs shipped in the latest Thinkpads die if you try and construct an encrypted filesystem on them. Somehow they corrupt themselves during the initial write of random data to the disk.

(Interesting that these SSDs died whilst under high write load too: is this a particular weak point for some reason?)

[+] coldtea|12 years ago|reply

>SSDs still seem to have a bunch of nasty failure cases. Right now, for production use I'm not sure I'd trust the things for reliable storage.

You should not trust anything for "reliable storage". That's what backups and redundant drives are for.

[+] ronilan|12 years ago|reply

Losing data s but the analogy is flawed.

“When all your engines stop, the flight is just starting"

[+] bdonlan|12 years ago|reply

When your engines stop, you trim for best glide ratio, point yourself at an open field, and fly the plane all the way to the ground. Failure to do so is a great way to get yourself killed.

[+] lloeki|12 years ago|reply

Not in a helicopter (although you can try autorotation, which is akin to trying to make a brick fall like a maple seed)

[+] anigbrowl|12 years ago|reply

BTW I should add that I'm still a fan of TOR and have no plans to switch away. In fact, their upfront explanation of what went wrong and how has improved my opinion of the project.

[+] chiph|12 years ago|reply

Sympathies. We had a similar thing happen at a previous job in the days of the "deathstar" drives. Lost a drive, no biggie. Tell the DC guy to replace it. Lost a second drive, told him to start running towards our cage. Lost a third drive, uhh - how current are our backups?

Different job - had a developer accidentally run a where-clause-less delete in production. Same net result. RAID and SANs are definitely not backup solutions.

[+] tetha|12 years ago|reply

Very much so. For us, raids are mostly time we can use to move all the important data off of that raid. It might be drastic, but it's safe and our data isn't too big.

[+] skore|12 years ago|reply

If your service goes down, be sure to at least have some notion of what your service does on the homepage. Right now, it's only an error report and I had to skim through the blog for a while to figure out that it's a kind of Google Reader replacement.

When your engines stop and you're getting lots of eyeballs, don't assume they all know you.

[+] achille|12 years ago|reply

Storage is cheap compared to engineering labour. That's a massive migration (from one DB to another), definitely not worth saving that 300gb.

It's still unclear how the storage failure was related to the migration. Was the new engine/fs disruptive to the SSD?

[+] Narkov|12 years ago|reply

They mentioned pretty explosive growth so 300GB today may be 3TB next month and 100TB in a few years.

[+] nknighthb|12 years ago|reply

The Old Reader is not a commercial venture. Time and money are not fungible when they're donations.

Edit re migration connection: Anything that stresses hardware already close to the edge can trigger a failure.

[+] unknown|12 years ago|reply

[deleted]

[+] coldtea|12 years ago|reply

>When all your engines stop, the flight is over

Pedantic and off-topic as it is, this is incorrect. At least for airplanes.

When all your engines stop you continue to glide, and you can even manage to land succesfully with a little skill and luck.

[+] praptak|12 years ago|reply

A notable case thereof, involving a Boeing 767: http://wikipedia.org/wiki/Gimli_Glider

[+] tjoff|12 years ago|reply

So, a glide to an emergency landing means that the flight is not over?

Similarly just because a rebuild fails doesn't mean that the data is lost. Just pop in the drives individually and fetch whatever you can from all of them, it is seldom the case that the exact same parts of them become irretrievable at the same time or that the drives stop function completely. It's just some more work than a regular rebuild.

[+] rxp|12 years ago|reply

Ouch, sounds like somebody got hit by a really bad batch of drives. Always a risk when you buy a bunch at the same time. :/

[+] nbevans|12 years ago|reply

In my experience, all SSDs are a bad batch unless they're Intel 520 or Samsung's high end range.

A big mistake to make is to fill an SSD drive completely up with data. This is a huge no-no unless it's an enterprise drive which usually use a totally different design.

[+] nknighthb|12 years ago|reply

Either those drives are all from the same bad batch, or this isn't a drive problem. I would be thinking either a cooling problem or flaky controllers, and a drive swap in that case is not the solution.

[+] akg_67|12 years ago|reply

Doubt those drives were from bad batch. As these were SSDs, most likely write wear. Two SSDs should never be in RAID 1 pair or replace at least one SSD in RAID 1 at predetermined schedule instead of replacing on failure only. They are going to wear off at same pace and will fail about the same time.

47 comments