Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> IBM estimated in 1996 that one error per month per 256 MiB of RAM was expected for a desktop computer.

From the wikipedia article on "Soft error", if anyone wants to extrapolate.





That makes it vanishingly unlikely. On a 16GB RAM computer with that rate, you can expect 64 random bit flips per month.

So roughly you could expect this happen roughly once every two hundred million years.

Assuming there are about 2 billion Windows computers in use, that’s about 10 computers a year that experience this bit flip.


> 10 computers a year experience this bit flip

That's wildly more than I would have naively expected to experience a specific bit-flip. Wow!


Scale makes the uncommon common. Remember kids, if she's one in a million that means there are 11 of her in Ohio alone.

~800 bit flips per year per computer. 2 billion computers with 800 bit flips each is 1,600,000,000,000 (one point six trillion) bit flips.

Big numbers are crazy.


I saw a computer with 'system33', 'system34' folders personally. Also you would never actually know it happened because... it's not ECC. And with ECC memory we replace a RAM stick every two-three months explicitly because ECC error count is too high.

Got any old microwaves with doors that don't quite shut all the way nearby? Or radiation sources?

Nah, office building. And memtest confirmed what that was a faulty RAM stick.

But it was quite amusing to see in my own eyes: computer mostly worked fine but occasionally would cry what "Can't load library at C:\WINDOWS\system33\somecorewindowslibrary.dll".

I didn't even notice at first just though it was a virus or a consequences of a virus infection until I caught that '33' thing. Gone to check and there were system32, system33, system34...

So when the computer booted up cold at the morning everything were fine but at some time and temp the unstable cell in the RAM module started to fluctuate and mutate the original value of a several bits. And looks like it was in a quite low address that's why it often and repeatedly was used by the system for the same purpose: or the storage of SystemDirectory for GetSystemDirectory or the filesystem MFT.

But again, it's the only time where I had a factual confirmation of a memory cell failure and only because it happened at the right (or not so, in the eyes of the user of that machine) place. How many times all these errors just silently go unnoticed, cause some bit rot or just doesn't affect anything of value (your computer just froze, restarted or you restarted it yourself because it started to behave erratically) is literally unknown - because that's is not a ECC memory.


Rounding that to 1 error per 30 days per 256M, for 16G of RAM that would translate to 1 error roughly every half a day. I do not believe that at all, having done memory testing runs for much longer on much larger amounts of RAM. I've seen the error counters on servers with ECC RAM, which remain at 0 for many months; and when they start increasing, it's because something is failing and needs replaced. In my experience RAM failures are much rarer than for HDDs and SSDs.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: