groundy
agents & frameworks

Can Spec-Driven Development Keep AI Coding Agents From Drifting?

A June 2026 preprint makes spec-code divergence a blocking merge condition, treating traceability as a CI gate that catches silent drift before it compounds.

9 min···3 sources ↓

The Two Failure Modes: Context Explosion and Silent Spec-Code Drift

The Spec Growth Engine opens by naming two structural failure modes that it argues existing spec-driven agent work does not fully solve, and both are recognisable to anyone who has let a coding agent run past a single sitting.

The first is context explosion: the agent is forced to reason over an entire repository at once, and output quality degrades as the context window fills. The second is what Grabowski calls silent spec-code drift: code evolves, the specification does not, and the divergence becomes invisible until it is costly to repair.

The drift failure is the more interesting of the two because it is structural rather than capacity-bound. A fuller context window can, in principle, be fixed with a larger model or better retrieval. A spec that has quietly stopped describing the code has no such remedy. The ground truth of intent has already rotted, and the agent is now being graded against a memory of what the system was supposed to do, not what anyone actually asked for last week. The paper’s contribution is to treat that rot as a first-class failure mode with a mechanism to prevent it, rather than a hygiene problem teams are trusted to manage.

How the Spec Growth Engine Works: Graph, Spine, and Drift Gate

The framework anchors the agent to a machine-readable spec graph and makes divergence from it a merge blocker, not a style note.

The spec graph is the durable artifact. Its nodes carry an explicit separation between contract and design, so the externally observable behaviour of a component and the internal decision behind it are tracked as distinct concerns. A Spine context assembler scopes the agent’s working context to an ownership path through that graph rather than the full repository, which is the mechanism aimed at context explosion: the agent reasons over the slice it owns, not the whole tree. A vertical-slice growth protocol enforces a hardest-first ordering on how the system is built out, the abstract states.

The drift gate is the load-bearing piece. Spec-code divergence is made a blocking merge condition, so an implementation that has drifted from its spec cannot land until the spec is reconciled. Traceability becomes a CI gate rather than a convention a team may or may not honour. That is the architectural bet in one sentence: move the spec from a document someone might read into a check that fails the build.

The synthesis is deliberately familiar. The paper folds in Parnas information hiding, C4 architecture notation, Architecture Decision Records, the Walking Skeleton pattern, Reflexion Models, and Fitness Functions, and positions the result as lean and code-coupled without the overhead of RUP or MDA. Whether “lightweight” survives contact with a repo that changes daily is a separate question.

The Wild Evidence: Agent Configs Are Already Treated as Static Artefacts

The failure mode the Spec Growth Engine targets is not hypothetical. It is already measurable in the field, in a study that landed the same week.

A concurrent preprint, A Deterministic Control Plane for LLM Coding Agents, examined 10,008 public GitHub repositories covering 6,145 agent configuration files. The findings read like a checklist of the conditions that let drift compound. 10.1% of tracked agent config paths are exact SHA-256 duplicates across independent repositories. 58% of agent configs have a single-commit history. Fewer than 1% declare permission boundaries, against 33% of CI/CD Actions workflows that do. And 75.5% of the duplicate config pairs cross organisational boundaries, which means the copying is not an internal convention but a pattern diffused across the ecosystem.

The revision-rate numbers are the sharpest signal. Agent configs are revised at 0.4 commits per month, against 0.6 commits per month for CI/CD workflows, the same study reports. Lower churn on a younger artifact type is the fingerprint of something treated as set-and-forget. If the harness configuration is static and unmanaged, the spec sitting above it loses what little hold it had: nothing downstream is enforcing that the code still matches what anyone wrote down.

The same paper proposes its own remedy, the Rel(AI)Build control plane, which treats agent definitions as a managed supply chain using SHA-256 content addressing, HMAC-stamped lockfiles, and hash-chained audit logs, and detects prompt drift via Jaccard similarity. This is complementary to the Spec Growth Engine, not a competitor. Rel(AI)Build governs the harness layer, the agent definitions themselves; the Spec Growth Engine governs the spec-to-code link. A team worried about silent drift would eventually want both, but they are solving different halves of the same problem.

What Production Agents Actually Do: AgentX as a Contrast Case

A system that already lets agents rewrite production code in an industrial setting has no machine-linked spec layer at all, which is precisely the gap the Spec Growth Engine is built to fill.

AgentX, described as production-deployed at an industrial recommender platform, runs a closed loop across four stages. A Brainstorm Agent synthesises evidence from historical experiments, system architecture, data analysis, and external research into ranked, executable proposals. A Developing Agent translates each proposal into production-ready code through repository-grounded generation and multi-dimensional reliability verification. An Evaluation Agent runs safe online rollout with guardrail-vetoed A/B judgment, turning both successes and failures into structured knowledge assets. A Harness Evolution layer, labelled SGPO, distils execution trajectories into semantic-gradient updates that feed back into the agents themselves.

The loop is impressive as an engineering artefact. It is also a clean demonstration of the missing layer. AgentX verifies code for reliability and judges it against guardrails, but no specification artifact appears in the loop. The proposal, the code, and the online result are all tracked; the durable statement of intent is not. An AgentX-style system accumulating months of autonomous edits is exactly the environment in which silent spec-code drift would be hardest to detect after the fact, because every individual edit passed its local check. “The agent verified its own output” and “the output still matches the original intent” are not the same claim, and current production designs optimise hard for the first and assume the second.

The Honest Limits: A Design Proposal Without Outcome Data

The Spec Growth Engine is a design proposal, and the paper is explicit that it has no outcome data behind it.

There are no controlled experiments comparing spec-anchored development against prompt-driven development at real team scale. Developer outcome measures, defect rates, drift incidents averted, merge-gate friction, and maintenance burden are listed as future work, not reported. The concurrent config-management study confirms the problem the framework addresses is genuine and widespread, but confirming a problem exists does not validate a particular solution to it.

The “lightweight” framing deserves its own scrutiny. The paper sets the Spec Growth Engine against heavyweight predecessors like RUP and MDA and declares itself lean by comparison. That is almost certainly true relative to those baselines. It is less obviously true relative to the actual alternative most teams run, which is no spec discipline at all. Maintaining a machine-readable spec graph with contract/design separation, ownership paths, and a CI-enforced drift gate is a real, ongoing cost. Whether that cost is lower than the cost of debugging a codebase whose original intent exists only in an expired chat history is the empirical question the paper does not answer, and it is the only question that determines whether the framework gets adopted.

What Teams Should Actually Do Today

The defensible takeaway is narrower than the paper’s full frame, and it does not require adopting the whole architecture.

The single durable lesson is that a machine-checkable link between specification and code is the only mechanism shown to catch silent divergence before it compounds. A team can capture most of the value without a spec graph by keeping at least one bidirectional, automated check: either the spec generates assertions the code must satisfy, or the code generates a spec artifact a human can diff against intent. The form matters less than the fact that the check exists and runs on every merge.

The paper is explicit that its spec graph is not a heavyweight architecture document, and teams should resist the urge to treat it as one. ADRs and C4 notation work precisely because they are small, versioned, and written down as decisions rather than maintained as a synchronised model. The Spec Growth Engine wants the graph to behave the same way. The risk is that “machine-readable spec graph in sync with actively evolving code” quietly grows back into the RUP-style overhead the paper positions itself against.

The simpler reality the framework has to beat is this: most shops today let the diff be the source of truth, and the prompt that produced it is ephemeral. That works for a single feature and fails over a quarter, because nothing in the loop remembers what the system was supposed to do. The Spec Growth Engine names that failure mode clearly and proposes a structural fix. Whether the fix is worth its maintenance cost is a question only adoption and outcome data will settle, and as of June 2026, neither exists.

Frequently Asked Questions

How does Rel(AI)Build’s drift detection differ from the Spec Growth Engine’s drift gate?

Rel(AI)Build flags drift by computing Jaccard similarity between successive agent definition versions, a numeric ratio of shared to total tokens that surfaces gradual prompt edits over time. The Spec Growth Engine’s drift gate is binary: it blocks a merge outright if the code no longer traces to a spec node. They also sit on different layers, agent definitions versus the spec-to-code link, so a team can run both without overlap.

What concrete maintenance work does a drift gate add to a team’s existing CI?

Beyond a normal build pipeline, the team must keep a spec graph whose nodes separate contract from design, tag each code change with the spec node it implements, and reconcile the spec before any divergent merge can land. The recurring cost is editing that graph on every behavioural change, which is precisely the discipline most teams currently skip, and is the line item that decides whether the framework stays lean or collapses back into RUP-style overhead.

Where does the Spine context assembler’s ownership-path scoping break down?

The ownership-path slice assumes a change lives inside one branch of the spec graph. Cross-cutting work, such as a logging or auth change that touches several ownership paths at once, forces a choice: expand context back toward the full repository and reintroduce the context explosion the Spine was built to avoid, or split the change across paths and lose the single coherent reasoning thread it is meant to preserve.

What outcome data would settle whether the Spec Growth Engine is worth adopting?

The decisive study is the one nobody has run: spec-anchored teams against prompt-driven teams on the same codebase, tracked over months for drift incidents caught, defect rates, and merge-gate friction. The paper’s lean claim also leans on a weak baseline, because RUP and MDA are heavyweight strawmen; the harder comparison is against no spec discipline at all, which is what most shops actually run today.

Does the drift gate help a small team shipping one feature in a single sprint?

Less than the paper implies. Silent drift needs enough accumulated edits for the spec and code to diverge beyond what one person still holds in memory, and a single sprint rarely produces that volume. The gate earns its maintenance cost once long-running or autonomous agentic work is generating diffs whose original intent no one on the team can still recall, which is the regime where the failure mode compounds invisibly.

sources · 3 cited

  1. A Deterministic Control Plane for LLM Coding Agentsarxiv.orgprimaryaccessed 2026-06-27