Source: The Register
Article note: Written for a friend who asked and got more than they probably wanted:
It's the latest in a series of the same observation.
DRAM ECC has been getting more common because the statistics of bit-flip errors start to get unfavorable as memories get bigger; DDR5 will have some ECC in all parts. That's basically how all the rowhammer type attacks work.
We're two generations in to filesystems that do error correction and integrity checking (first the journaling FSes, then the zfs/btrfs stuff that actually does at-rest integrity verification).
And CPUs being incomprehensibly complicated is biting us in the ass everywhere with simple stuff like fdiv bugs in the 90s and now with constant microcode updates to cover up problems (see sandsifter, spectre, etc.)
People were worried about it already in the late 70s, several mainframe designs supported processor concensus, and Intel's iAPX 432 parts (what was supposed to take over instead of the 486) were expressly designed to run in lockstep groups and eject outliers, they have a special pin to set that behavior (they also failed _hard_ for other reasons)
Rarely seen miscalculations now crop up frequently at cloud hyperscale
Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner.…