CVE-Factory Turns Published CVEs Into Security Agent Training Data. A 32B Model Beats Claude 4.5 Sonnet.

AI agents can reproduce known vulnerabilities at expert quality. A fine-tuned 32-billion-parameter model now beats Claude 4.5 Sonnet on a security benchmark built from real CVEs. The question is whether reproduction counts as finding, and what happens when the cost of generating vulnerability tasks drops to a few dollars per CVE.

From CVE metadata to executable exploit tasks

arXiv:2602.03012 introduces CVE-Factory, a six-stage multi-agent pipeline that takes published CVE metadata and produces fully executable vulnerability-reproduction packages: Dockerfiles, test scripts, and verified patches. The first three stages (decoupling) decompose the CVE into independent subtasks; the last three (coupling) verify the assembled environment works end-to-end.

Each agent operates as a full autonomous Claude Code session with a defined role, goal, and verification method, not a rigid tool-use loop. An Orchestrator manages agent activation, validates results, and routes feedback via continue/error/pause signals. The March 2026 update added four new agent types (Judger, Changer, Comparer, Expert) and switched tool access from an allowlist to a denylist, expanding what each agent can reach.

The output is a complete task package, not just a reproduction script. That distinction matters because prior work like CVE-GENIE reproduced approximately 51% of 841 CVEs from 2024-2025 at $2.77 per CVE but produced task formats incompatible with agent training. CVE-Factory was designed to close that gap.

Expert parity, with a 24-point caveat

Cross-validated against human expert reproductions, CVE-Factory achieves 95% solution correctness and 96% environment fidelity (assessed in the paper’s cross-validation study against expert-built tasks). On 554 CVEs from 2025, the pipeline initially reproduced 499 (90.1%), according to the GitHub evaluation. After rigorous expert review of 471 successful cases, 312 (66.2%) were confirmed as complete and accurate reproductions. The 24-point gap between initial and verified success rates is the number that matters. Initial reproduction counts anything that runs; verified reproduction requires the exploit to match the original CVE conditions precisely.

A 32B model, 4,000 traces, and a commoditization signal

The most striking finding is economic, not technical. Fine-tuning Qwen3-32B (dubbed Abacus-cve) on CVE-Factory’s generated traces produces a model that scores 35.79% on LiveCVEBench, a benchmark of 190 tasks spanning 14 programming languages and 153 repositories. That is a 6.8× improvement over the base model’s 5.29%, and it surpasses Claude 4.5 Sonnet at 34.39%. The training data: approximately 18,800 agent traces from the Abacus-cve-v1.1 release in March 2026. An open-weight 32B model approaching frontier-model performance on security tasks, trained on synthetic data that costs dollars per trace to generate, is a commoditization signal. Offensive-security expertise is getting cheaper to distill.

Gains generalized beyond LiveCVEBench. The same model improved from 12.5% to 31.3% on Terminal Bench, according to the paper, suggesting the learned capability transfers across security-reasoning tasks rather than overfitting to a single benchmark.

Reproduction is not discovery

CVE-Factory reproduces known vulnerabilities from published metadata. It does not discover new ones.

A separate benchmark, CVE-Bench, measures how well AI agents exploit unseen web vulnerabilities. State-of-the-art agent frameworks manage only 13% of critical-severity targets. The gap between 66% verified reproduction on known CVEs and 13% exploitation on novel ones is the distance between reading an answer key and writing one.

This is not a criticism of CVE-Factory, which explicitly targets benchmark construction and agent training. But conflating reproduction rates with discovery capability overstates what the system does. The pipeline proves agents can execute expert-level vulnerability tasks when given the structure. Whether they can find the structure themselves remains a different, harder problem.

The triage problem

If CVE-Factory can synthesize expert-grade vulnerability tasks by the thousand, and if fine-tuned smaller models can approach frontier performance on those tasks, the downstream pressure falls on open-source maintainers already drowning in low-signal bug reports.

When anyone can generate thousands of plausible vulnerability reports cheaply, the cost of producing noise drops toward zero while the cost of triage stays fixed. A maintainer reviewing 50 AI-generated bug reports to find one genuine finding is already operating at a loss. CVE-Factory’s 66% verified accuracy means roughly one in three synthetic tasks contains inaccuracies a human must catch.

What comes next

The CVE-Factory repo now contains 3,181 CVE task environments and serves as the backbone of LiveCVEBench, which is designed to be continuously updated to track emerging threats, including AI-tooling vulnerabilities in projects like PyTorch. The paper’s acceptance as an ICML 2026 Oral (v3 updated May 29, 2026) gives the work academic weight.

The roadmap points toward OneFactory, a unified pipeline covering terminal, SWE, and security tasks in a single framework. If the pattern holds, generating training data for every class of code-agent task becomes a mechanical process, not a research project.

The arms race is asymmetric. Generating offensive capabilities gets cheaper. Defensive triage does not. The CVE-Factory numbers are real and the technical contribution is sound. What the security ecosystem does with cheaply synthesized expertise is the part that has not been figured out yet.

Frequently Asked Questions

How does CVE-Factory relate to Anthropic’s Mythos vulnerability discovery work?

Anthropic’s Mythos system found 271 previously unknown security flaws in Firefox with a near-zero false positive rate, prompting Mozilla to state it had ‘completely bought in’ on AI-assisted bug discovery. CVE-Factory operates in the opposite direction: it reproduces already-disclosed vulnerabilities from published metadata rather than surfacing new ones. The two systems are complementary. Mythos generates novel findings; CVE-Factory converts disclosed CVEs into structured, executable training tasks.

What did the PatchEval cross-validation reveal beyond the 95% correctness figure?

The 95% reflects functional fidelity against 215 human-expert-built PatchEval tasks. A stricter qualitative assessment found that 74% of CVE-Factory’s automated tests were rated equivalent or superior to the expert originals. The remaining 26% fell short primarily in edge cases where CVE metadata was ambiguous about the precise vulnerability trigger or affected version range, a limitation that no amount of pipeline automation can fully compensate for when the upstream data is incomplete.

How much distance remains between the fine-tuned 32B model and the strongest frontier model?

Claude Opus 4.5 leads LiveCVEBench at 41.27%, roughly 5.5 percentage points above Abacus-cve’s 35.79%. On PatchEval, Abacus-cve climbs from 5.66% to 23.58%, a 4.2x improvement that generalizes across benchmarks. The frontier model retains its edge on the hardest tasks, but the open-weight model covers most common vulnerability patterns at a fraction of the inference cost.

What did 18,800 agent traces and 3,181 task environments actually cost to produce?

CVE-GENIE’s reference cost of $2.77 per CVE covers environment reconstruction alone, and CVE-Factory adds multi-stage verification overhead on top of that. Generating the full Abacus-cve-v1.1 training corpus likely cost in the low five figures of dollars. For comparison, the 215 PatchEval tasks each required a domain expert to manually construct Dockerfiles, test scripts, and verified patches, a process that took months of specialized labor.