top | item 20311106

Toshiba and WD NAND Production Hit by Power Outage: 6 Exabytes Lost

162 points| deafcalculus | 6 years ago |anandtech.com | reply

101 comments

order
[+] pbhjpbhj|6 years ago|reply
This doesn't make sense:

>Toshiba Memory and Western Digital on Friday disclosed that an unexpected power outage in the Yokkaichi province in Japan on June 15 affected the manufacturing facilities that are jointly operated. //

Surely that's not the reason, it would have to be "and local [backup] power failed, and the failovers for that failed too"??

Toshiba manufacture generators too, it's not like they'd need to go far to get backup power designed for them.

There must be more to this? (Which explains why people are assuming it's suspicious, I guess; and this site is making 35% of global NAND output).

FWIW, I hadn't realised that it takes ~2months to process a wafer in to a chip.

[+] 55873445216111|6 years ago|reply
Semiconductor fabs require MASSIVE amounts of electric power. In fact, to the first approximation, 5-10% of the cost of a silicon wafer is purely the cost of the electricity consumption. Source: worked in finance dept managing fab spend
[+] tmaly|6 years ago|reply
There had to be a problem with backup systems.

In the US in certain industries, you have to do quarterly disaster recovery testing.

I am surprised they don’t do something like that here. The losses would certainly warrant it.

[+] wyxuan|6 years ago|reply
Yeah I was surprised. Don't they have a ups for this kind of thing?
[+] maheart|6 years ago|reply
Here's an article from one month ago discussing the over-supply of NAND and DRAM (and the effect it has on pricing): https://www.forbes.com/sites/tomcoughlin/2019/05/25/nand-dra...

I can't help but feel very skeptical about the timing of this event, given the history of price-fixing in the industry.

[+] GordonS|6 years ago|reply
These kind of issues do seem to hit with a suspicious degree of regularity - it seems every 1-2 years there is a shortage due to some calamity or such...
[+] BubRoss|6 years ago|reply
A power outage no less. Seems like the sort of thing that could be solved.
[+] ksec|6 years ago|reply
Important to quote from the comment section

>Five fabs and an R&D center, outage was after the batteries also ran out.

For perspective, the batteries at GF's leading fab can run the 1/3 of the systems for only a few minutes. That's the scale we're dealing with.

I think before we do all sort of conspiracy theory, we need to look into reason for why was there an outage in Yokkaichi.

[+] jsjohnst|6 years ago|reply
Batteries (and giant multi-to. spinning wheels, which serve the same purpose) are not a long term power supply. They are intended to only bridge the couple minutes until generators can come online and provide stable power. So yes, it’s expected that they drained, the question is why didn’t the generators come online?
[+] taspeotis|6 years ago|reply
I would be most grateful if someone could please explain what sort of tools are likely to be used here, and why a power loss to those tools would ruin days/weeks/months worth of output relative to the time they were offline?
[+] furi|6 years ago|reply
Semiconductor manufacturing involves a lot of precisely controlled processes. You put the wafers into furnaces and pass a gas over them for X time and Y flow rate at Z temperature to impregnate the wafer with various chemicals. You put them in low pressure plasma environments to etch them, again for X time at Y flow rate. There are half a dozen more of these as well, like applying metal and implanting ions.

These values are experimentally tightened to get the highest possible accuracy to the desired effect and improve the number of working chips that leave the factory. If the power cuts out you don't know what conditions the wafer experienced while the system was winding down completely uncontrolled and your processes haven't been designed for the wafer going through the ramp up twice.

The reason why it's lost so much output is because modern semiconductor processes have hundreds of steps and (I believe) a lead time in the months, so the amount of material that's in flight at any one instant has to be huge to get any reasonable throughput.

[+] AFascistWorld|6 years ago|reply
Acids, the manufacturing basically is controlling the etching process to get rid of unwanted parts and keep the designed metal circuits.

When you lost power, you are not sure if the chips stayed with acids for too long or too short, or coated with unwanted amount of materials, the uncertainty kills the yield rate, which can be already low since memory chips require repeated stacking nowadays.

Similar to https://i.stack.imgur.com/yTQqw.jpg

[+] kop316|6 years ago|reply
I imagine that the process is very highly pipelined and optimized, and I would imagine that they had some sort of backup (generator) that failed.

One analogy is to think about if you had a batch script that you were working on that touches a lot of files (1000s). Now imagine power was cut and the batch script was interupted because the computer turned off, but that computer hasnt been turned off in a long time (say it was a server).

First, you have to turn the server back on after a power outage. Was there corrupted files in it? You have to now get that server in a known working state, and if you have kept it on for years....then you may be in a world of hurt.

NOw you got your server up and running. You have the option of going through each of the 1000s of files your script was working on....but that will take time. Does it make sense to start from scratch? You will have to through our all the files you were working on, but at least you can start that script again. You could attempt to salvage every file, but that will also take time too.

[+] baybal2|6 years ago|reply
I recall the story of Micron's first Chinese fab: 1 millisecond out phase brownout and they loose few megabucks instantly, and like that during every electrical event.

Giant UPSes are not an option in the industry because fabs eat oodles of electricity, and it is cheaper to loose a megabuck once a year than build a stabilisation/ups plant

[+] vpribish|6 years ago|reply
my friend - it's lose, not loose.
[+] gruez|6 years ago|reply
So... does that mean they'll be hiking NAND prices, just like with HDD prices after the Thai floods?
[+] thesimp|6 years ago|reply
Looking at the numbers it should not move that much. According to this article, https://www.businesswire.com/news/home/20190307005812/en/TRE..., in 2018 912 exabytes of HD and SSD storage was sold. 800 exabyte for HD and 112 exabyte for SSD. And the SSD market grew 45% in 2018. If manufacturers project to grow at the same rate then 2019 SSD shipments will be around 162 exabyte. This puts the 6 exabyte loss at around 3.5%.

But we all know that markets are driven by emotion: losing 3.5% of your raw materials in a market that is projected to grow 45% will cause big fluctuations. But that is just my opinion.

[+] jacquesm|6 years ago|reply
No, it means prices will go up because there is a shortage, like with everything else. Sugar, Gasoline and so on are good examples. HDD prices are just one more item that follows the supply/demand curve.

Sure there will be some clever parties that will make some money anticipating this. But that's the same reason why the price of the gas at the pump that was already in the tank jumps up because of a shortage somewhere else. The whole stock is instantly valued at a different price.

[+] otakucode|6 years ago|reply
I think that is entirely up to Toshiba/WD. The profit margin on NAND is so astronomical and the price charged for it is so completely decoupled from the cost of production (which is as close to nothing as anything gets) that they could afford to just absorb the 'loss', but it might mess with their projected schedules of how much they had expected to make, so I could see them jacking up the price to compensate. The market and society in general seems to be content with permitting the NAND manufacturers price-fixing even when it's become absurd (do a tally of the raw materials and processes involved in producing 1TB of modern mechanical hard drive storage compared to a dumb regular parallel array of 1TB of NAND gates... it's ludicrous) so they've got whatever flexibility they feel like using.
[+] agumonkey|6 years ago|reply
First time I have to really think about Exa<unit>.

     Giga / Tera / Peta / Exa
6 Millions Terabytes of solid state memory.. quite a mass.
[+] otakucode|6 years ago|reply
I know that NAND involves no exotic raw materials, so does that enable them to recycle any of the damaged/lost wafers? I don't know very much about the physical processing/preparation of the raw silicon and such that goes into making a wafer, could you simply grind up or perhaps chemically dissolve everything back to base components and re-create a fresh wafer?
[+] baybal2|6 years ago|reply
At least some scrap is now being bought by solar cell industry, but that material is forever lost for IC making because it's already contaminated with dopants and metals
[+] icefo|6 years ago|reply
I wonder what failed in their redundant power supply because they surely have something.

I hope the postmortem will be public !

[+] perlgeek|6 years ago|reply
> I hope the postmortem will be public !

So do I.

It turns out that backup power fails more often than one would hope.

Generators fail to come online, batteries not performing as expected despite recent maintenance, switching gear failing, or the switching gear's safety mechanisms preventing a successful switch etc.

Source: I work for a smallish ISP, and have heard lots of stories from the ISP community, and am always eager to read about outages when there's a public postmortem.

[+] tzs|6 years ago|reply
> I wonder what failed in their redundant power supply because they surely have something

Based on comments on the site, it appears that even a very short power disruption can mess up semiconductor manufacturing.

If that is true, then a backup power system that involved detecting an outage and starting up generators might be too slow.

If based on generators, they'd either need to have the generators always running, or have a second redundant system based on batteries that can immediately take over during the time it takes to start the first redundant system.

Or they could run their stuff off batteries all the time, with the batteries charged from the grid. They will still need something that can very quickly switch to the grid in the case of their own battery powered inverters failing.

All of these are going to add complexity and cost that may drive up the effective cost of electricity enough that if may be cheaper in the long to simply go with the grid, if they are in a place with a reliable enough grid.

Anyone know how reliable the grid is at their location?

[+] igravious|6 years ago|reply
If my reading comprehension has not let me down then a 13 minute power disruption can cause them to lose 1/2 of their output for a quarter.

Given the massive consequences of quite a short disruption maybe they need to figure out how to weather disruptions more robustly?

[+] YayamiOmate|6 years ago|reply
This seems weird that 13 minute outage can kill month and a half production.

I wonder if this is standat hi-tech factory process reliability.

[+] jotm|6 years ago|reply
I guess it makes more sense to destroy everything affected by a power loss (even if some of it could be perfectly fine, or salvageable) than risk shipping products that will fail at a higher rate. That would cost way more in lost trust and lost sales.
[+] social_quotient|6 years ago|reply
Conspiracy: this is how the NSA buys their disk space.
[+] glbrew|6 years ago|reply
That is an interesting idea but it would be so much easier for them to constantly buy, say 5%, of the output.
[+] jamiek88|6 years ago|reply
At least 12 exabytes upgrade.

Wow.

Maybe they switched to using electron internally?!

[+] ggm|6 years ago|reply
The fragility of the supply chain.
[+] Rickvst|6 years ago|reply
The stock of these companies really took a hit. Sarcasm.
[+] loudouncodes|6 years ago|reply
Every time I see Godzilla he’s ensnared in and tearing down power lines. This was bound to happen sooner or later.