DARPA's AIxCC Postmortem: What Autonomous Cyber Reasoning Systems Got Right and Wrong

A new SoK paper accepted at USENIX Security 2026 gives the first systematic postmortem of DARPA’s AI Cyber Challenge (AIxCC), the 2023-2025 competition that pitted autonomous cyber reasoning systems against real open-source codebases. The headline numbers from competition organizers are striking: 77% discovery rate on synthetic vulnerabilities, 45-minute average patch times, and 18 real zero-days found across 53 challenge projects. The fine print is more interesting. Every one of the seven finalist CRSs was unusable outside its competition cloud sandbox after the event ended, and the systems patched zero C-codebase zero-days despite patching 11 of 12 Java ones. The gap between contest conditions and production is where the actual engineering lessons live.

What AIxCC Was and What the SoK Paper Adds

AIxCC ran from 2023 through August 2025, when the seven-team final concluded at DEF CON. It is a spiritual successor to DARPA’s Cyber Grand Challenge (CGC, 2014-2016), which focused on binary exploitation on a custom OS called DECREE. AIxCC moved to real-world open-source software in C and Java, drawn from 24 OSS-Fuzz repositories, with LLM infrastructure provided by Anthropic, Google, Microsoft, and OpenAI.

The SoK paper (arXiv:2602.07666, v4 updated May 29, 2026) draws on all seven finalist CRS codebases, the complete competition database, and discussions with organizers and every finalist team. That access matters: most prior coverage relied on DARPA’s own press announcements and a handful of DEF CON reports. The paper documents not just what the systems achieved but where competition design distorted CRS behavior, where patch validation broke down, and why none of the systems could survive contact with a different infrastructure after the contest.

The Competition Design: Scoring, Phases, and a Mid-Course Correction

The AIxCC final ran approximately 143 hours across seven phases. Seven finalist teams analyzed 53 challenge projects (48 scored), covering 63 challenge-project vulnerabilities. Each team received $85,000 in Azure compute credits and $50,000 in LLM API credits, which sets the resource envelope for interpreting the results.

Scoring was weighted to reward patching above all else: patches earned the highest point values per submission, Proof of Vulnerability (PoV) submissions earned less, and SARIF vulnerability assessments earned the least. Bundle submissions had bipolar scoring that could go negative. An accuracy multiplier penalized low-accuracy teams steeply while leaving high-accuracy teams nearly unaffected. The scoring function, in other words, rewarded volume and boldness from teams that were already accurate, while punishing scatter-shot submissions heavily.

Between the semifinal and final rounds, DARPA corrected two design flaws documented in the paper. First, the semifinal required teams to submit a self-PoV, a proof of vulnerability for each bug they claimed to find, before they could submit a patch. This disadvantaged CRSs that were strong at patching but weak at exploit generation, so the final scored patches against all teams’ PoVs rather than requiring each team to produce its own. Second, the semifinal adopted challenges per repository, which demanded significant engineering effort from organizers. The final switched to adopting OSS-Fuzz projects directly, simplifying challenge creation at the cost of reducing control over the vulnerability landscape.

What the CRS Architectures Actually Did

The paper classifies finalist CRS architectures into four pipeline stages: discovery (finding vulnerabilities), triage (prioritizing findings), patching (generating fixes), and bundling (aggregating related findings). Not every team implemented all four stages, and the paper documents considerable variation in how teams composed LLM calls, static analysis tools, and dynamic testing into their pipelines.

Most challenge-project vulnerabilities were hand-crafted synthetics inspired by historical N-day issues, deliberately designed to avoid AI training contamination. A small number of genuine zero-days were surfaced during challenge development. The challenge corpus spanned repositories from 16,000 lines of code (libexif) to 4.9 million (Wireshark), giving the systems a realistic range of target complexity.

The Numbers: Synthetic Vulnerability Performance

In the final scored round, CRSs identified 77% of synthetic vulnerabilities, up from 37% at the semifinals. They patched 61% of those synthetic defects, up from 25% at semifinals.

The improvement from semifinal to final is large enough to warrant scrutiny. The paper attributes it to both the scoring redesign (removing the self-PoV gate) and the teams’ iteration on their CRS pipelines between rounds. Whether it also reflects the shift from per-repository challenges to OSS-Fuzz adoption, which may have changed the vulnerability distribution, is not separated out in the analysis.

The C-vs-Java Gap and Language-Specific Tooling

Teams also discovered 18 real zero-day vulnerabilities: 6 in C codebases and 12 in Java. They provided patches for 11 of them. The distribution by language is sharp: zero of the C-codebase zero-days were patched automatically, while 11 of the Java zero-days were.

The paper does not conclusively attribute this disparity to a single cause. Java’s memory safety eliminates entire classes of vulnerabilities (buffer overflows, use-after-free) that remain common in C, and the Java ecosystem’s mature static analysis and testing tooling gives CRS pipelines more reliable signals to work with. Whether this gap reflects intrinsic CRS capability differences by language or differences in the surrounding tooling ecosystem remains an open question. For practitioners evaluating autonomous patching for their own stacks, the language of the target codebase is a significant variable.

The Post-Competition Portability Problem

Perhaps the most telling finding in the SoK paper is not about what the CRSs did during the competition but what happened after. OSS-CRS (arXiv:2603.08566, March 2026) attempted to run all seven open-sourced finalist CRSs on real-world OSS-Fuzz projects. Their conclusion: all seven remained “largely unusable outside their original teams, each bound to the competition cloud infrastructure that no longer exists.”

The OSS-CRS team ported the first-place system, Atlantis, and ran it against eight OSS-Fuzz projects. It discovered 10 previously unknown bugs, 3 rated high severity. That a competition-winning CRS needed months of porting work to run outside its original environment is itself a finding. The CRSs were not designed for portability; they were designed to score well under specific competition constraints on specific infrastructure. The porting effort is documented in the OSS-CRS paper as substantial engineering work, not a reconfiguration exercise.

Where CRS Pipelines Overfit and Patch Validation Failed

The SoK paper’s lessons-learned analysis identifies several specific failure modes that competition conditions either masked or incentivized:

Overfitting to seeded vulnerabilities. Because the challenge vulnerabilities were synthetic and designed to be findable, CRS pipelines that optimized for competition scoring could develop heuristics tuned to the characteristics of planted bugs rather than organic ones. The paper documents cases where CRSs produced high-confidence vulnerability reports that matched the competition’s expected finding patterns but did not correspond to actual security issues in production forks of the same codebases.

Patch validation brittleness. Validating that a patch actually fixes a vulnerability without introducing regressions is harder than generating the patch. CRS pipelines that relied on test-suite pass rates as their validation signal could produce patches that passed existing tests but did not fully close the vulnerability or introduced subtle behavioral changes. The competition’s scoring function rewarded patch submissions regardless of patch quality beyond a passing test suite, which may have incentivized quantity over precision.

The accuracy-scoring interaction. The accuracy multiplier’s steep penalty at low accuracy rates and near-zero penalty at high accuracy rates created a strategic incentive for already-accurate teams to submit aggressively. Teams that were less accurate had to be more conservative, which could suppress their discovery and patch numbers. This is a sound competition design for differentiating finalists, but it means the raw numbers do not represent the systems’ unbounded capability.

The Triage Economics Question

The average patch submission time was 45 minutes. For comparison, bug bounties for similar vulnerability classes typically range from hundreds to hundreds of thousands of dollars.

Those economics are the reason DARPA and ARPA-H added $1.4 million in follow-on prizes for integrating AIxCC technology into real-world critical-infrastructure software. If autonomous systems can do credible first-pass triage on vulnerability reports at the speeds the competition demonstrated, the bottleneck shifts from finding bugs to validating that autonomous findings are real and that autonomous patches do not cause regressions. The competition did not test this validation loop under realistic conditions.

Team Atlanta (Georgia Tech, Samsung Research, KAIST, POSTECH) won $4 million; Trail of Bits won $3 million; Theori won $1.5 million. The prize structure is DARPA’s standard incentive model. What matters for practitioners is not who won but what the winning and losing systems had in common: dependency on competition-specific infrastructure, brittleness outside curated vulnerability distributions, and a patch-validation gap that competition scoring did not penalize heavily enough to force solutions.

The SoK paper makes a credible case that autonomous vulnerability discovery and patching is past the proof-of-concept stage. The OSS-CRS porting effort makes an equally credible case that the distance from competition prototype to production tool remains substantial. Whether the performance demonstrated under competition conditions holds outside a curated corpus, subsidized compute credits, and a simplified validation regime is the question that determines whether CRS technology changes how OSS vulnerability triage works or remains a DARPA showcase.

Frequently Asked Questions

What did each autonomous triage task cost in AIxCC?

The average cost per competition task was approximately $152, combining compute and LLM API consumption across the $85,000 Azure and $50,000 LLM credit allocations per team. That figure covers the full discovery-to-patch pipeline. Production deployments would lack subsidized competition credits, so actual per-finding costs depend on negotiated cloud and API pricing, but $152 establishes a rough floor for autonomous vulnerability triage.

How did the scoring point values shape team strategy?

Patches earned 3 to 6 points each, Proof of Vulnerability submissions earned 1 to 2, and SARIF assessments earned 0.5 to 1. Bundle submissions ranged from minus 7 to plus 7. The accuracy multiplier imposed a 6% penalty at 50% accuracy and 13% at 40%, while leaving teams above 90% nearly unpenalized. This created a structural incentive for accurate teams to submit aggressively, since the marginal cost of a wrong answer dropped to near zero once accuracy cleared the high threshold.

Would these CRS results transfer to Rust or Go codebases?

The competition tested only C and Java. Rust eliminates entire memory-safety vulnerability categories that CRSs failed to patch in C, but it introduces different bug classes (logic errors, concurrency issues) that may require different analysis tooling. Go carries its own memory model and static analysis ecosystem. The SoK paper does not extrapolate findings beyond C and Java, and the performance gap between those two languages alone suggests that language-specific tooling maturity is a first-order variable.

What integration targets did the $1.4M follow-on prizes fund?

DARPA and ARPA-H directed follow-on prizes at critical-infrastructure software, with healthcare as an explicit focus. HHS deputy secretary Jim O’Neill cited a 491-day average patch time in healthcare at DEF CON as the motivating problem, compared to 60 to 90 days in most other industries. The follow-on work aims to close the distance between the competition’s 45-minute synthetic patch turnaround and the compliance, validation, and regression-testing constraints of production infrastructure deployments.