AI Productivity Metrics Don’t Tell You Whether the Embedded System Still Works

Key takeaways

  • AI speeds up implementation in embedded development, but HIL testing, validation cycles, and certification evidence requirements stay the same regardless of how the code was generated.
  • AI-generated embedded code can pass unit tests and survive code review while still violating timing contracts, hardware state assumptions, and protocol behavior that only surface on real devices under real conditions.
  • Every generated change adds to the validation surface. More code produced faster means more paths to test, more traceability to maintain, and more certification evidence to produce.
  • AI productivity metrics capture throughput at the implementation layer. They do not capture whether generated changes respect the timing budgets, integration contracts, and hardware constraints that determine whether the system actually works.
  • Vibe coding is particularly risky in constrained environments because AI-generated embedded code uses the right patterns and idioms, making it hard to reject on sight, while still missing device-specific timing, memory, and hardware behavior requirements.
  • The right review question for AI-generated code in embedded systems is not whether the function is correct in isolation. It is whether the change can survive the timing budgets, hardware states, and integration contracts of the system it is entering.

AI makes the code-writing part faster. That gain is visible immediately: a task closes sooner, a PR appears faster, and a developer spends less time drafting the first implementation.

The harder question is what happens after that code reaches the actual system. If the generated change fails on one hardware revision, breaks timing under load, or behaves differently on the real device than it did in tests, the saved hours can come back as days of debugging integration.

That is where AI productivity measurement gets difficult in system-level software. The easy part is counting how fast the code was produced. The harder part is knowing whether the time saved during implementation survives validation, integration, and field behavior.

What AI speeds up in embedded and system-level development

Drafting a peripheral driver skeleton, generating protocol message parsers, scaffolding hardware abstraction layers, producing boilerplate for device initialization sequences: these tasks take less time with capable tooling, and the generated output is often a reasonable starting point.

The things that get faster are also the easiest to count. Commit volume, PR frequency, and task completion rate. Organizations building AI maturity frameworks can track this layer with reasonable confidence, and many of the better frameworks are doing exactly that.

The phenomena that those frameworks track well are not the phenomena that govern whether a system-level product holds together under real conditions.

Why does correct AI-generated code still break embedded systems

Production failures in embedded software, industrial control systems, robotics, medtech devices, and edge computing platforms rarely trace back to incorrect implementations of isolated functions. They trace back to constraints and interactions that were never visible at the unit level.

A generated implementation of a sensor polling loop runs correctly in the test. What the model didn’t have context for is that the original loop was tuned to match the sensor’s wake-up timing on a specific hardware revision. The 250ms poll interval was not arbitrary; it reflected the device’s actual readiness window. The generated version cleans up the loop, adjusts the interval to something more regular, and introduces occasional data missing under production load. Nothing in the unit tests catches this. The failure model shows up in field data six weeks later.

Or: AI rewrites a communications handler to be more idiomatic. The original used a non-blocking read with a timeout because the device on the other end could hold the bus during a calibration cycle. The generated version blocks. Under normal operation, this doesn’t matter. When calibration runs during a specific operational state, the main loop stalls for 400ms, which trips a watchdog, which causes an unplanned reset.

Or: generated retry logic for a CAN bus message retries immediately on no acknowledgment. The original implementation backed off because the receiving node has a fixed message processing window. Immediate retries cause bus congestion during high-load periods. The system degrades under exactly the conditions where it needs to be most reliable.

These are not exotic failures. They are the normal failure surface of embedded and system-level software: constraints that live in hardware behavior, protocol timing, device state machines, and platform-specific assumptions that don’t appear in any function signature. AI tooling generates code from what it can see. What it can see is rarely the full constraint space.

The problem with AI “vibe coding” in embedded systems

There’s a working term for AI-assisted implementation done without enough system context: vibe coding. The output feels right. It matches the pattern. It uses familiar abstractions.

In application software, there’s a correction mechanism: production feedback is fast, deployment is reversible, and failure modes are usually operational rather than physical. A bad retry policy in a web service causes elevated latency and some downstream errors. You observe it, roll back, and fix it.

In embedded and system-level work, vibe coding is more dangerous, not because it’s informal, but because the plausibility of the output is high relative to how well it matches the underlying constraints. The model has been trained on large amounts of embedded code. It knows what a driver looks like. It knows interrupt service routine patterns, DMA buffer management, and I2C transaction structure. The generated code looks like embedded code, and in many senses it is. What it often doesn’t reflect is this particular device’s timing budget, this protocol’s error frame behavior, this platform’s memory map, this product’s regulatory validation envelope.

That surface validity makes the code harder to reject. This makes it more likely to enter the codebase. This makes the constraint violations more likely to survive into hardware-in-the-loop testing or into the field.

Why faster code generation creates more validation work

In SaaS and application development, the rough model is: write code, review it, run tests, deploy, and observe. AI tooling accelerates the write step, and if the other steps are reasonably instrumented, you can see the effect.

Embedded and system-level development doesn’t work this way. Between “code exists” and “code is validated,” there is hardware-in-the-loop testing on actual device configurations, regression across hardware revisions and firmware variants, timing and load testing against real peripherals, protocol conformance testing against the physical layer, certification evidence generation, traceability documentation, safety analysis, and field-scenario simulation for failure and recovery paths.

Each of these takes time and lab resources that don’t scale on demand. You can’t provision more HIL rigs the way you provision compute. Validation matrices don’t compress because the upstream team started using Copilot.

AI-generated code is not free output. It is a new validation burden. Every generated change that enters a regulated or safety-adjacent codebase needs the same evidence trail as a manually written change. More code generated means more paths to test, more traceability to maintain, and more certification evidence to produce. If the generated code introduces corner cases the author didn’t anticipate, which it frequently does because the model was working with a partial constraint context, the validation burden increases nonlinearly.

The throughput metric goes up. The validation pipeline becomes the bottleneck. Those two things are connected.

What AI productivity metrics miss in system-level software

The better enterprise AI measurement frameworks are not naive. The more serious ones track defect density by commit type, try to separate real delivery gains from AI activity that doesn’t translate to completed work, and in some cases attempt to correlate adoption patterns with downstream incident rates.

None of this is irrational. The difficulty is structural rather than methodological. Many of the expensive failure modes in system-level software don’t emit discrete, timestamped events that observability pipelines consume. Architectural entropy is a property of a codebase that changes slowly and whose effects are distributed across future engineering efforts. A timing assumption violated in a generated driver doesn’t show up in the cycle time data. It shows up eventually in a HIL regression, a field return, a support escalation that traces back through several layers to a change that looked fine when it merged.

Organizations can instrument prompt volume, acceptance rates, and PR throughput long before they can instrument whether generated changes are consistent with the timing contracts of the devices they run on. That’s not a criticism of the framework builders. It reflects something real about the observability of different engineering phenomena.

Throughput is legible. Constraint coherence is not.

Why faster AI-assisted development overwhelms HIL testing

Teams working in medtech imaging workflows, industrial control platforms, robotics, and long-lifecycle embedded products are already experiencing this, even if they’re not framing it this way.

The symptom is usually: we’re generating more code, our review queue is longer, and our HIL lab is at capacity. Acceleration at the generation layer does not compress the verification layers below it. Those layers exist because the system interacts with physics, hardware behavior, and operational conditions that have to be tested rather than inferred.

Firmware modernization projects are a particular case. AI-generated rewrites of legacy code can be structurally cleaner than what they replace and still break behavior the original code preserved through what looked like unnecessary complexity. The complexity wasn’t unnecessary. It was load-bearing in ways that weren’t written down. Catching that requires someone who understands the system well enough to ask whether the simplification is actually safe, which is a different skill from reviewing whether the code is correct.

Reviewing AI-generated code in embedded and constrained systems

When an engineering team reviews AI-generated code, the common question is some version of: Does this code do what it’s supposed to do?

For system-level work, the more useful question is: can this change survive the system it is entering?

That requires understanding the timing budget of the execution path this code runs in, the hardware states possible when this function executes, the protocol behavior of the device it communicates with, including error frames and edge conditions, and what assumptions the rest of the codebase makes about this component. None of that is visible in the diff.

Throughput metrics improve immediately. The constraint violations invoice later, in a different part of the organization’s tracking, is attributed to something other than the implementation decision that started them. That accounting gap is the actual measurement problem, and in embedded and system-level work, the distance between those two layers is larger than in most other domains.

Whether teams close that gap deliberately or wait for the failures to make it visible is, at the moment, genuinely open.

Once a month: what we’ve built, seen, and learned.