Govern the Repo, Not the Agent: A New Risk Metric for AI-Native Code

AI-native software should be governed at the repository level, not one agent decision at a time. That is the central claim of arXiv:2606.28235, a preprint by Daniel Russo and colleagues released on June 26 that analyzes more than 930,000 agent-authored pull requests. Their finding reframes machine-generated code from a prompt-engineering problem into a supply-chain governance problem, one that accumulates in the repo rather than in any single agent.

Where does AI-native code risk actually live?

The paper argues that ecosystem-level risk is a property of the repository, not the agent. Russo et al. model integration friction across over 930,000 agent-authored pull requests and find that roughly half of the variation in friction remains associated with the repository after controlling for the contribution, the author, the change size, and the agent itself. That is a strong signal. If the agent were the dominant source of trouble, controlling for it would drain most of the repository-level variance. It does not.

The comparison to human contributions sharpens the point. Agent-authored changes concentrate repository-level integration friction at roughly twice the rate of human changes: the intraclass correlation is 0.30 for agent contributions versus 0.16 for human contributions. The gap persists after accounting for codebase size, age, task shape, process maturity, and merge path. The repository is not merely a passive container for whatever the agent produces. It actively modulates how much friction each contribution generates.

This matters because the current policy debate is mostly about agents. Vendors publish SWE-bench scores and safety red-teaming reports for individual models; enterprises buy guardrails that inspect prompts and outputs; audit teams ask whether Claude, Gemini, or Qwen is safe enough to deploy. The Russo paper suggests the wrong unit is being measured. A safe agent dropped into a brittle repository can still amplify risk, and a riskier agent in a well-governed repository may do less cumulative damage. The ecosystem is the level at which the risk aggregates, so the ecosystem is the level at which it should be governed.

What do 930,000 agent-authored pull requests show?

The PR data show that integration friction is sticky to repositories. Even when the authors account for who wrote the change, how large it is, what tool generated it, and how it reached the codebase, repository-level factors retain about half of the explanatory power. That means the same agent, working on the same kind of task, produces measurably different downstream outcomes depending on which repository receives the change.

The mechanism is not fully specified in the paper, but the result points toward several plausible culprits. Dependencies decay at different rates in different repositories. Test coverage is uneven. Build systems accumulate implicit assumptions that agents do not read. Code review norms and merge queues filter some classes of mistakes and not others. These are not agent properties. They are repository properties that shape how machine-generated code lands and propagates.

The finding also implies that counting successful agent runs is a poor proxy for safety. An agent can pass a unit test in isolation and still degrade the repository’s integration surface. The PR may merge, but it can leave behind new failure modes in CI, dependency conflicts, or undocumented coupling that the next agent will inherit. Risk compounds across commits, not within a single session.

Why do per-agent audits miss the accumulation?

Per-agent audits measure the model at a point in time, while the risk they are supposed to contain accumulates across the repository history. A benchmark score tells you what the agent can produce on a controlled task; it does not tell you what happens when its tenth, hundredth, or thousandth change interacts with every previous change in a specific dependency graph.

Agent-authored pull requests do more than insert logic. They update lockfiles, bump transitive dependencies, regenerate bindings, and reformat configuration. Each action is locally small but globally consequential. A single dependency mutation can change the vulnerability surface of the entire project. A regenerated binding can silently alter API semantics. Per-agent benchmarks rarely include these side effects because they test the agent on a fixed repository snapshot, not on the evolving ecosystem the agent is about to modify.

This gap is becoming visible in adjacent research. The A.S.E benchmark evaluates security in AI-generated code at the repository level using real-world CVE-grounded repositories. It finds Claude-3.7-Sonnet leads overall and Qwen3-235B-A22B-Instruct scores highest on security, but it also reports that concise fast-thinking decoding outperforms slow-thinking reasoning for security patching. That result is counter-intuitive if you evaluate agents as isolated reasoners; it makes more sense once you treat security as a repository-level property shaped by how patches interact with existing code and tests.

Agent routing research points in the same direction. Agent-as-a-Router formalizes model routing as a Context-Action-Feedback loop and reports that augmenting a vanilla LLM router with task-dimension performance statistics yields a 15.3% relative gain. Its CodeRouterBench contains roughly 10,000 task instances scored across eight frontier LLMs. The unit of improvement is not a better model; it is a better mapping between task properties and model capabilities across a repository’s actual task distribution.

When can repo-level evaluation work?

Repository-level evaluation is still mostly a research proposal, but this week has produced several working prototypes. The HORIZON framework treats hardware design as repository-level code evolution and reports 100% benchmark completion across ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories. The authors are careful to note that these are controlled proxies and the problem is not solved. The 100% figure is a completeness claim on the benchmark suite, not a declaration that AI-designed hardware is now reliable.

A.S.E provides the security analogue: real CVE-grounded repositories rather than synthetic snippets. Multi-agent evaluation adds another lens. Contagion Networks studies preference propagation in multi-agent LLM evaluator systems and finds that preferences consistently propagate between agents with gamma values between 0.157 and 0.352. Shared architectural priors dominate explicit prompts as the driver of contagion, but increasing committee size from k=1 to k=3 cuts effective contagion by 68.9%. Again, the intervention is at the system level, not the single-agent level.

What should platform and CI/CD teams do now?

Platform teams should treat machine-generated changes as a distinct input class in the delivery pipeline. That means labels, audit gates, and traceability, not just a better model card. If the risk lives in the repository, the controls must live there too.

The first practical step is to tag agent-authored commits at submission time. Not as a stamp of shame, but as a routing signal. CI should know whether a change came from a human, from an agent acting under human supervision, or from an autonomous agent so that it can apply the right test coverage and review thresholds. The Russo paper gives this practice a quantitative justification: because agent contributions concentrate friction at 0.30 versus 0.16 for humans, repositories need stronger integration gates for the agent input class.

The second step is to measure integration friction as a repository health metric. Time-to-merge, CI failure rate, revert rate, and downstream breakages are not productivity metrics alone; they are risk signals. When an agent’s PRs consistently trigger more friction in one repository than another, the problem is likely the repository state, not the agent.

The third step is to extend SBOM-style traceability to generated code. A generated commit should carry metadata about the agent version, the prompt or task context, the reasoning depth used, and the approval path. Teams should also version the repository states that agents operate on. If an agent is allowed to propose changes against main, the repository should have a known-good baseline that the agent cannot overwrite without human review. Rollback paths should be automatic for agent-authored merges, at least during a calibration period. The goal is not to slow agents down unnecessarily; it is to prevent a chain of low-friction PRs from establishing dependencies that no one notices until the next agent inherits them.

CI/CD vendors are the natural place for these controls to land. They already own the repository boundary. The open question is whether they will treat AI-generated changes as a first-class pipeline input or continue to route them through the same gates as human commits. The repository governance model Russo proposes only works if the pipeline can distinguish machine-generated change from human change without forensic guesswork.

What is still unclear?

The Russo paper proposes a metric, not a product or standard. Translating its repository-level risk model into production tooling will take time and will probably produce false positives along the way.

One open question is whether the 0.30 versus 0.16 intraclass correlation gap holds across private codebases with different tool chains and review cultures. ArXiv PRs are a large and observable sample, but they are not necessarily representative of enterprise repositories. Another question is whether the gap will shrink as agents improve or grow as more of the codebase becomes agent-authored. If agent contributions come to dominate the commit history, the repository itself becomes a record of accumulated agent decisions, and the distinction between agent-level and repo-level risk may blur.

There is also the human side. The AI Risk Repository analyzed 74 major AI-risk frameworks containing 1,725 distinct risks and found that human decisions cause nearly as many AI risks (38%) as the AI systems themselves (42%). Better repository governance cannot fix bad incentives, shortcut-taking reviewers, or organizational pressure to ship. It can only make the consequences more visible.

The immediate takeaway is clear enough: stop asking only which agent is safe enough to use, and start asking which repositories can safely absorb machine-generated change. That question is harder to answer, but it is the one that actually governs risk.

Frequently Asked Questions

Is the 0.30 intraclass correlation a ceiling or a floor on repository-level agent risk?

A floor. The Russo metric scores friction that surfaces at merge time: failed CI runs, reverts, merge latency. Latent defects that clear review and appear weeks later, such as a regenerated binding that quietly shifts API semantics, do not enter the calculation. The true repository-level share of agent risk sits above 0.30.

What minimum CI instrumentation reproduces the input-class comparison inside a private codebase?

Four fields per pull request: an origin label (human, supervised agent, autonomous agent), merge outcome, CI failure count, and revert flag. The origin label is the gating prerequisite. CI logs already carry the other three, so without the label the 0.30 versus 0.16 split cannot be computed for any closed-source repository.

What does the Contagion Networks finding imply for agent code review committees?

The 68.9% contagion cut from one reviewer to three only holds when the three are different model families. Shared architectural priors, not prompts, drive preference propagation, so three instances of one model inherit the same contagion channel as a single reviewer. A single-family panel behaves closer to one reviewer than three.

Does the repository-level risk framing hold for artifacts beyond application code?

The quantitative evidence is application pull requests, but the mechanism generalizes. HORIZON applies the repository-as-evolution lens to hardware design, and agent PRs routinely mutate lockfiles, generated bindings, and infrastructure-as-code. The governance unit is any artifact class where small local edits propagate global coupling.

How does the agent versus repository split map onto existing software supply-chain governance?

Per-agent audits correspond to upstream component vetting, asking whether the model is safe to deploy. Repository-level ICC corresponds to downstream composition analysis, asking whether the assembled system holds. The Russo finding is the agent-code analogue of moving from CVE scanning of individual libraries to tracking how those libraries interact inside a specific build.