Skip to content
News & Analysis

What If Your Chip, Plane, Data Center Silently Failed?

How big a deal are silent errors that reportedly occur in chips at a rate of one in 5,000? This “glitch” could devastate hyperscalers such as Meta, Google, Microsoft and others whose business must balance between scale and sustainability.
What If Your Chip, Plane, Data Center Silently Failed?
(Image: iStock)

Share This Post:

By Junko Yoshida

What’s at stake:
Picture chips in data centers silently failing, leaving no trace in system logs. Such undetectable errors could steadily spread contagion across several services. Consider such a scenario in a two-engine airplane. Suppose one of the engines silently dies, unnoticed. After landing, the pilot takes off again for a new mission, assuming he has two functioning engines. One could say that this is impossible because the pilot can see — and hear — the busted engine. Unlike the plane, a busted engine in a datacenter hyperscaler can’t be seen or heard, and won’t kill anyone. But the silent crash of a critical component could trigger system-wide failures.

Designers, manufacturers and users of chips have long dreaded “soft errors,” if chips subjected to particle strikes from cosmic rays suffer unexpected bit flips.

Meanwhile, hyperscalers are lately alarmed about “hard errors” in chips with a physical defect that slipped through the manufacturing testing process or degraded gradually while deployed for a long time.

Both types of error are devastating to computing systems, especially when their “silence” affects critical missions. When chips give no indication that something has gone wrong or miscalculated, the phenomenon is called “Silent Data Corruption (SDC).”


This is great stuff. Let’s get started.

Already have an account? Sign in.