top | item 46057033

(no title)

The most common catastrophic failure you’ll see in SSDs: the entire drive simply drops off the bus as though it were no longer there.

Happened to me last week.

I just put it in a plastic bag into the freezer during 15 minutes, and works.

I made a copy to my laptop and then install a new server.

But not always works like charms.

Please always have a backup for documents, and a recent snapshot for critical systems.

discuss

serf|3 months ago

to be perfectly fair though, this isn't a new failure mode when SSDs arrived on the scene.

drive controllers on HDDs just suddenly go to shit and drop off buses, too.

I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.

toast0|3 months ago

> I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.

This is exactly the opposite of my lived experience. Spinners fail more often than SSDs, but I don't remember any sudden failures with spinners, as far as I can recall, they all have pre-failure indicators, like terrible noises (doesn't help for remote disks), SMART indicators, failed read/write on a couple sectors here and there, etc. If you don't have backups, but you notice in a reasonable amount of time, you can salvage most of your data. Certainly, sometimes the drives just won't spin up because of a bearing/motor issue; but sometimes you can rotate the drive manually to get it started and capture some data.

The vast majority of my SSD failures have been disappear from the bus; lots of people say they should fail read only, but I've not seen it. If you don't have backups, your data is all gone.

Perhaps I missed the pre-failure indicators from SMART, but it's easier when drives fail but remain available for inspection --- look at a healthy drive, look at a failed drive, see what's different, look at all your drives, predict which one fails next. For drives that disappear, you've got to read and collect the stats regularly and then go back and see if there was anything... I couldn't find anything particularly predictive. I feel disappear from the bus is more in the firmware error category vs physical storage problem, so there may not be real indications, unless it's a power on time based failure...

jandrese|3 months ago

I don't know how true this is, but it seems to me that SSD firmware has to be more complex than HDD firmware and I've seen far more SSDs die due to firmware failure than HDDs. I've seen HDDs with corrupt firmware (junk strings and nonsense values in the SMART data for example), but usually the drive still reads and writes data. In contrast I've had multiple SSDs, often with relatively low power-on hours, just suddenly die with no warning. Some of them even show up as a completely different (and totally useless) device on the bus. Drives with Sandforce controllers used to do this all of the time, which was a problem because Sandforce hardware was apparently quite affordable and many third party drives used their chips.

I have had a few drives go completely read only on me, which is always a surprise to the underlying OS when it happens. What is interesting is you can't predict when a drive might go read-only on you. I've had a system drive that was only a couple of years old and running on a lightly loaded system claim to have exhausted the write endurance and go read only, although to be fair that drive was a throwaway Inland brand one I got almost for free at Microcenter.

If you really want to see this happen try setting up a Raspberry Pi or similar SBC off of a micro-SD card and leave it running for a couple of years. There is a reason people who are actually serious about those kinds of setups go to great lengths to put the logging on a ramdisk and shut off as much stuff as possible that might touch the disk.

PunchyHamster|3 months ago

We have a fleet of few hundred HDDs that is basically being replaced "on next failure" with SSD and that is BY FAR rarer on HDDs, maybe one out of 100 "just dies".

Usually it either starts returning media errors, or slows down (and if it is not replaced in time, slowing down drive usually turns into media error one).

SSDs (at least a big fleet of samsung ones we had) are much worse, just off, not even turning readonly. Of course we have redundancy so it's not really a problem, but if same happened on someone's desktop they'd be screwed if they don't have backups.

dale_glass|3 months ago

> I just put it in a plastic bag into the freezer during 15 minutes, and works.

What's that supposed to do for a SSD?

It was a trick for hard disks because on ancient drives the heads could get stuck to the platter, and that might help sometimes. But even for HDDs that's dubiously useful these days.

ssl-3|3 months ago

> It was a trick for hard disks because on ancient drives the heads could get stuck to the platter, and that might help sometimes.

Stuck heads were/are part of the freezing trick.

Another other part of that trick has to do with printed circuit boards and their myriad of connections -- you know, the stuff that both HDDs and SSDs have in common.

Freezing them makes things on the PCB contract, sometimes at different rates, and sometimes that change makes things better-enough, long-enough to retrieve the data.

I've recovered data from a few (non-ancient) hard drives that weren't stuck at all by freezing them. Previous to being frozen, they'd spin up fine at room temperature and sometimes would even work well-enough to get some data off of them (while logging a ton of errors). After being frozen, they became much more complacent.

A couple of them would die again after warming back up, and only really behaved while they were continuously frozen. But that was easy enough, too: Just run the USB cable from the adapter through the door seal on the freezer and plug it into a laptop.

This would work about the same for an SSD, in that: If it helps, then it is helpful.

ahartmetz|3 months ago

Semiconductors generally work better the colder they are. Extreme overclockers don't use liquid nitrogen primarily to keep chips at room temperature at extreme power consumption, but to actually run them at temperatures far below freezing.

rcxdude|3 months ago

It could be due to a dodgy connection - changing temperature might make the two halves of a broken conductor touch again.

lvl155|3 months ago

Always make backups to HDD and cloud (and possibly tape if you are a data nut).

zamadatix|3 months ago

I don't think one should worry as much about what medias they are backing up to as if they are answering the question "does my data resiliency match my retention needs".

And regularly test restores actually work, nothing worse than thinking you had backups and then they don't restore right.