top | item 31017003

(no title)

robotsteve2 | 3 years ago

Any sort of hardware or software error seems much more likely. Computers are incredibly complex and approximations are used everywhere (in the design of the hardware, in the theory of operation). I don't think inference-based experiments or analysis on cosmic ray bit flips are appropriate.

You really need some kind of dedicated cosmic ray detector nearby as a control. If the flux of cosmic rays into the detector is orders of magnitude lower than the rate of bit errors you ascribe to cosmic rays, it's probably some hardware/software issue and not the cosmic rays.

discuss

order

jldugger|3 years ago

Indeed, there was a study in IEEE pointing out the absurdity of cosmic rays as causes -- one point cited was that the vast majority of bitflip happen at specific points in the address space, page boundaries between chips essentially

bqmjjx0kac|3 years ago

I'm curious why that is evidence against the cosmic ray explanation.

Couldn't it have something to do with the physical layout of memory? Perhaps those page-boundary-adjacent addresses present a larger physical target, perhaps on the bus.

Of course I am wildly speculating right now. I'd love to see the article if you have a link!

cozzyd|3 years ago

I'd be very interested in reading that article if you have a link (or title, or doi...)

AshamedCaptain|3 years ago

I believe people use "cosmic rays" as catch-all phrase for all these very low probability error causes (just because of the coolness of cosmic rays), but in practice _any_ other cause is much more common than cosmic rays.

Even at the processor level every single transistor on it has a rated mean time between failures a.k.a. MTBF. Sure it may be astronomical, but you do have a lot of transistors, so in practice a random bitflip is not such a rare event. Designers actually explore MTBF vs power usage trade-offs here, and there is even a fascinating area of "fault resilient computing" research.

Every single clock domain crossing has another MTBF (google metastability). Again they are very high (billions of years if done properly), but you will have plenty of such crossings (and the number keeps growing with modern, more asynchronous design).

Processors are quite unreliable things.

throw10920|3 years ago

Ironically, even though the more modern, "asynchronous" (really, more just asynchronous communication between fully-synchronous clock domains) CPU designs result in more chances for metastability, a fully asynchronous, self-timed design wouldn't have to have any likelihood of metastability at all!

gnufx|3 years ago

Yes, but what you'd want to do is look for coincidences between a detector for a cosmic ray shower around (above?) the electronics you're monitoring with whatever it is these days that instruments ECC events. The time resolution would be pathetic for a nuclear physics experiment, but probably good enough.

If you look at the ambient gamma-ray spectrum in a semiconductor detector (which would be germanium rather than silicon) the main background you see is typically from concrete; I'm ashamed to say I've forgotten the energy from K-40, but in the region of 1500 keV. (Ironically, large concrete blocks used for shielding would be regarded as a significant radiation hazard if all the activity in them was concentrated.)