The Bug That Isn’t a Bug: When the Hardware Is Lying to Your Firmware

Key Takeaways

  • A class of CPUs and accelerators called mercurial cores can return wrong results with no crash, no error flag, and no ECC alert. The fault is in the silicon, and standard firmware debugging will not find it.
  • Meta confirmed hundreds of CPUs with reproducible silent data corruption across hundreds of thousands of servers over 18 months. Google documented the same independently. Both converge on roughly one faulty device per thousand.
  • Embedded and edge systems lack the fleet-scale redundancy hyperscalers use for detection. The burden falls on embedded testing at bring-up, runtime self-checks, and firmware architecture decisions made early in the program.
  • As process nodes shrink and AI accelerators move into safety-critical products, validation workflows built on the assumption of correct silicon are leaving programs exposed.

A sensor fusion stack on an embedded controller starts producing occasional position drift. Not consistently. Not reproducibly under a debugger. It surfaces in field logs, disappears on the bench, and shows no correlation with software changes.

The firmware development team audits the floating-point handling, checks interrupt latency, and reviews the Kalman filter. The code is correct. The investigation runs for weeks before anyone seriously questions the processor, because passing board bring-up and formal qualification closes that question. Once silicon qualifies, most programs stop treating the hardware as a variable.

We have seen variants of this in industrial automation, medical equipment, construction tools, and safety devices.

What a mercurial core is

Google’s 2021 paper “Cores That Don’t Count” named the phenomenon “mercurial core”: an unpredictably changeable CPU core. One that intermittently returns a wrong result with no crash, no machine check, no exception. The processor executes the instruction, writes back an incorrect value, and moves on. Nothing downstream knows the computation was wrong.

Meta published “Silent Data Corruptions at Scale” around the same time, documenting the same phenomenon from a fleet perspective. Their examples show exactly how this behaves in practice. One faulty CPU computed Int(1.153) as 0 instead of 1. Another returned 32809 for Int[(1.1)^7] when the correct result is 26854.

These are not random bit flips caused by cosmic-ray events. They are deterministic, data-dependent miscomputations caused by silicon manufacturing defects that are subtle enough to survive factory testing. The fault is dormant under most operating conditions and only activates under a specific combination of data pattern, instruction sequence, temperature, or voltage margin. A device can pass every qualification test and still carry one.

The scale that made it visible

Meta ran targeted stress tests across their server fleet for over 18 months and identified hundreds of CPUs with reproducible silent data corruption. Google’s independent findings are consistent. Both converge on approximately one faulty device per thousand in large-scale deployments.

That figure deserves context. A 2024 survey aggregating multiple pre-production test campaigns across several CPU families found that 3.61% of tested CPUs showed some form of silent data corruption, roughly three or four in every hundred, with most appearing only under testing more aggressive than standard qualification.

IEEE Computer and IEEE Micro, two of the field’s primary technical publications, both dedicated coverage to silent data corruption in 2024 and 2025. Semiconductor Engineering has published strategy pieces on detection across CPU, GPU, and accelerator architectures. This has moved from hyperscaler-specific research into the mainstream reliability conversation for silicon design and embedded systems validation.

Why embedded teams cannot use the same detection method

Meta and Google found mercurial cores by running identical workloads across hundreds of thousands of machines and comparing outputs. When one machine consistently disagrees with every other running the same computation, it becomes a suspect. The method works and requires a fleet to run it.

A patient monitor deployed across a hospital network, an industrial safety controller on a factory floor, a gas analyzer installed across a dozen production sites – none of these run as a synchronized reference fleet. There is no comparison set to run against, and no statistical signal to catch an outlier.

The detection burden shifts entirely to embedded testing before deployment, and to how safety-critical firmware handles results that fall outside expected bounds during operation.

Where the exposure sits in embedded and edge systems

The consequences of a silent miscomputation depend on where it lands in the system.

In a data logging or device management pipeline (firmware telemetry, OTA update coordination, diagnostic reporting), a wrong intermediate value may average out, hit a downstream check, or get caught during routine data review. There is natural tolerance built into systems that aggregate and review outputs continuously.

In a safety-critical control path (a ventilator pressure regulation loop, a gas analyzer threshold trigger, an emergency shutoff actuator decision), the margin is narrower. ISO 26262 and IEC 62304 both assume deterministic hardware behavior. Safety cases are built on that assumption. Residual risk arguments cover software defects and defined hardware failure modes. Silicon that miscomputes and reports no fault sits outside that threat model, which means it also sits outside most existing mitigation strategies.

Silicon also changes over time. Electromigration, thermal cycling, and voltage stress narrow timing margins across years of field operation. A device that cleared qualification at bring-up can develop marginal arithmetic behavior in year three or four, with no component failure, no firmware change, and no observable transition point. Long-lifecycle programs in medical, industrial, and infrastructure applications carry this exposure by default.

Edge AI accelerators extend the problem further. NPUs and DSPs on current-generation nodes are moving into driver assistance, robotics, and industrial safety applications. Their validation stacks are still maturing relative to their architectural complexity, and the computational density that makes them effective also makes them more sensitive to manufacturing defects and aging effects.

Validation and firmware: where to focus first

The mitigations below are ordered by when they have the most leverage. Earlier decisions are harder to retrofit; later ones can be added incrementally but cost more per unit of coverage.

Stress compute units at bring-up, not just interfaces. Standard board bring-up confirms the device boots and passes functional checks at nominal conditions, but that is not sufficient for catching marginal silicon. Running arithmetic and vector operations at temperature and voltage corners with targeted edge-case data patterns is what surfaces mercurial behavior before qualification. An infusion pump controller that passes nominal tests may still miscompute a drug delivery calculation under the thermal load of continuous clinical use.

Build output bounds into firmware from day one. For computations with physically constrained results (a flow rate, a motor torque estimate, a sensor fusion output), add range and consistency checks rather than passing the Arithmetic Logic Unit (ALU) result directly downstream. Low overhead at design time, expensive to retrofit into an existing firmware architecture later.

Structure telemetry for hardware diagnosis, not just software debugging. Out-of-range values, self-test failures, and unexpected state transitions should be logged with timestamp, input state, and device identifier. Teams we have worked with regularly find that field anomalies were present in logs well before a failure became reproducible, but the data was not structured to support that kind of retrospective hardware analysis.

Add periodic compute self-tests for long-lifecycle deployments. Re-executing a fixed set of arithmetic operations against stored reference values at startup or during idle windows catches degradation that develops after deployment. Define the system response to a self-test failure before that situation arises in the field.

Document explicitly where the program assumes correct silicon, and for each assumption, identify which test cases and telemetry signals would need to change if that assumption were relaxed. This turns a passive design choice into an auditable risk position.

Working with a V&V team that has traced hardware-attributed failures in regulated environments shortens the path considerably. PerformaCode engineers have run this investigation in medical device and industrial safety programs, including cases where the fault was initially logged as firmware and only attributed to silicon after systematic elimination across the stack. A team doing it for the first time under a field escalation is working on a different problem than one that has the reference patterns already.

Questions worth asking before the next program review

Most product and engineering leaders do not discover these gaps during planning. They surface during an escalation. A few questions that tend to find them earlier:

  • Did bring-up testing stress compute units at voltage and temperature corners, or did it confirm the device boots correctly at nominal conditions?
  • Do the computations feeding control decisions, safety interlocks, or diagnostic outputs have range or consistency checks, or does the firmware trust every hardware result unconditionally?
  • If a deployed unit started returning occasional wrong values, would field telemetry identify that as a hardware pattern, or would it be logged as an intermittent software issue and closed without root cause?
  • When was the validation approach last reviewed against the silicon generations and process nodes currently in production?

Programs that can answer these with specifics are in a different position than those where the honest answer is “we assumed the silicon was correct.” Closing that gap during development costs a fraction of what it costs once a field

Once a month: what we’ve built, seen, and learned.