Can AI Agents Repair Broken Network Configs? A New Benchmark Tests It

Network misconfigurations remain one of the most persistent causes of critical Internet outages, and the promise of using LLMs to fix them automatically has been circulating in NetOps vendor materials for the better part of two years. A June 2026 paper by Rufat Asadli and colleagues (arXiv:2606.06212) finally puts a structured benchmark behind that promise. The short version: LLM agents augmented with formal verification tools repair 12% more misconfigurations than base LLMs and are 17% less likely to introduce new errors in the process. The longer version is less encouraging.

What the benchmark tested

The paper evaluates both open- and closed-source LLMs on the task of repairing misconfigured computer networks, as documented in arXiv:2606.06212. The key architectural decision is the agentic loop: rather than presenting a raw LLM with a broken config and asking for a corrected one, the system wraps the model in two tool layers. Formal network verification tools check whether a proposed repair actually satisfies the network’s routing and policy constraints. Context retrieval tools let the agent pull in relevant topology information dynamically, rather than hoping everything fits in the prompt window.

This is not a trivial wrapper. According to the authors’ summary of the architecture, the performance gains come from the agent’s ability to iteratively validate proposed repairs against formal network standards before finalizing them. The LLM proposes a fix, the verification layer rejects or accepts it, and the loop continues. In theory, this guards against the most dangerous failure mode in network automation: a change that looks correct in prose but violates an implicit policy constraint.

The headline numbers, in context

The two headline figures from arXiv:2606.06212 are a 12% improvement in repair efficacy and a 17% improvement in safety, both measured relative to non-agentic base LLMs. These are not trivial gains. A 17% reduction in the rate at which an automated system introduces new errors is the difference between a tool operators might consider deploying and one they would not.

But the baseline matters. The comparison is against unadorned LLMs, not against human network engineers. The paper does not claim agents outperform people. What it shows is that adding verification and context retrieval to an LLM yields a measurable improvement over asking an LLM to fix a config cold. Whether that improved system is good enough for production use is a separate question, one the paper addresses only indirectly.

Where agents still break

The paper is explicit about its limitations. Even state-of-the-art models fail to resolve misconfigurations in large-scale, complex scenarios and often introduce new errors, according to arXiv:2606.06212. This is not a footnote. It is a central finding.

The scale problem is quantified in the earlier Cornetto benchmark (arXiv:2604.22513), published in April 2026 by the same lead author. Cornetto synthesized 231 misconfiguration problems across network topologies spanning 20 to 754 nodes, covering diverse protocols, and evaluated 9 LLMs. The finding: LLM performance degrades as topology size increases, and models frequently introduce regressions. A model that handles a 20-node network correctly can silently break a 200-node network with the same class of misconfiguration, because the fix that works locally creates a policy violation elsewhere in the topology.

This is the specific failure mode that should concern operators evaluating self-healing infrastructure claims. It is not that the agent produces garbage. It is that the agent produces a plausible-looking fix that resolves the reported symptom while creating a new one the operator did not ask about.

Why the diff still needs human eyes

Both papers converge on the same architectural conclusion: reliable LLM-powered network automation requires integrating models into iterative workflows guided by formal verification, as arXiv:2604.22513 and arXiv:2606.06212 each argue. This is a human-must-verify-the-diff constraint, not a suggestion.

The formal verification layer catches violations of stated network policy. It does not catch everything. Real networks carry implicit constraints: legacy exceptions, capacity planning assumptions, traffic engineering expectations that exist in an operator’s head but not in any formal model. An agent operating within a verified loop can produce a config that passes every formal check and still breaks something an experienced engineer would have caught on sight.

The agentic architecture described in the paper’s technical summary shifts the system from pure text prediction to active information management and validation. That is a genuine architectural improvement. It is also an acknowledgment that the model alone cannot be trusted with infrastructure changes.

What operators should take away

The practical question is not whether LLM agents can repair network configs in the general case. The benchmark shows they cannot do so reliably across large, complex topologies. The question is which specific classes of misconfiguration are safe to hand to an agent loop with formal verification, and which still require a human.

Based on the two papers, the boundary appears to run roughly along topology complexity. Small, well-constrained networks where the formal model captures most of the operational reality are reasonable candidates for automated repair. Large, heterogeneous topologies with undocumented constraints are not. The 12% repair improvement and 17% safety improvement from arXiv:2606.06212 are averages across all test cases. On the smaller topologies in the benchmark, the numbers are likely better. On the larger ones from Cornetto’s 20-to-754-node range, they are likely worse.

The honest framing for NetOps teams evaluating these tools is incremental. LLM agents with formal verification can catch and correct some classes of misconfiguration faster and more safely than unaided LLMs. They are not a replacement for the engineer who knows why that one static route exists. The benchmark’s contribution is giving operators a framework for deciding which is which, backed by numbers rather than vendor demos.

Frequently Asked Questions

What does a team need in place before deploying an agent loop like this?

The agentic architecture presupposes a machine-readable model of the network’s intended state and policy constraints. Without a formal verification backend (tools like Batfish or a custom reachability checker), the agent loop has no ground truth to validate against, and you are back to trusting raw LLM output. Teams also need their topology data in a structured format the retrieval tool can query, which is a nontrivial prerequisite for networks documented only in spreadsheets and tribal knowledge.

Do these findings transfer to cloud-native environments like VPCs and security groups?

The benchmark tested traditional network topologies with distributed routing and policy protocols. Cloud-native networking uses more declarative, centrally controlled configuration, so the cascading policy violations the papers found across multi-node topologies are less likely to take the same form. The formal verification approach should transfer, but the specific failure modes and regression patterns would need separate benchmarking against cloud control plane semantics.

Why can’t a larger context window solve the scale degradation problem?

The Cornetto results show performance degrading with topology size even when the model has access to the full configuration. The bottleneck is not information access but multi-hop policy reasoning: tracking how a route change at router A propagates through B and C to violate a constraint at D. Larger context windows give the model more text to process but do not inherently improve its ability to trace constraint interactions across hundreds of nodes. The verification loop is the load-bearing component, not the retrieval tool.

How can teams pressure-test a vendor’s self-healing networking claims against these numbers?

Both Cisco and Juniper have published AI-driven network automation materials pitching automated repair capabilities. The academic benchmarks provide data these materials omit: specific failure rates on topologies above 200 nodes and the frequency with which automated repairs introduce new errors. A team evaluating a vendor product could use the Cornetto problem set (231 misconfigurations across 20 to 754 nodes) as a standardized test suite to benchmark the vendor’s tool against published baselines.