Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Had something similar last year because of a core router fabric issue. A few years ago, there was a batch of new servers with buggy motherboards corrupting/dropping packets, can't begin to imagine how hard it was to diagnose.

That's in own datacenters, not cloud.



> can't begin to imagine how hard it was to diagnose.

Yeah, when it happened to me, it completely threw me for a loop. We had reports of corruption in video files, which started the debug cycle. It was shocking when we isolated the box causing the issue.

But I guess your bigger comment has to be right: About the only way to have this sort of error is at the hardware level, because basic CRC checking should otherwise raise some sort of alarm.


Keep in mind that hardware run with a firmware. What is called a hardware issue can actually be a software issue.

It wasn't just one box for us. Basically, the part number was defective (motherboard NIC), every single one that was manufactured. This affected a variety of things, since servers are bought in batch and shipped to multiple datacenters, damn impossible to root cause.

CRC can be computed by the OS (kernel driver) or offloaded to the NIC. I think it's unlikely for buggy CRC code to shipped to a finished product, it would be noticed that nothing works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: