When Bots and Agents Post CVEs in PRs, Reporters Inherit the Triage Burden

A pull request thread used to have two participants: the contributor and the reviewers. It now has three, because bots and coding agents post their own security commentary into the same space. When one of them drops a CVE reference, the discussion reads as already triaged, and the person who raised the concern inherits the job of proving it real. That inversion is the operational story, and it is playing out across thousands of popular repositories.

Who now posts security claims in a pull request?

A pull request’s security discussion is now a three-way conversation among humans, bots, and coding agents, each writing into the same PR titles, descriptions, review comments, commit messages, and timeline threads. A registered report published as arXiv:2606.28125v1 lays out a study that plans to distinguish explicit vulnerability references (CVE, CWE, GHSA) from implicit security signals such as “unauthorized access” or “SQL injection,” and to track all of them across the full lifecycle of a PR for each of the three actor types.

The three-way distinction matters operationally because the actors behave differently. Bots, with Dependabot as the dominant example, post dependency-bump pull requests and audit commentary. Coding agents author and review code changes. Humans contribute patches and moderate the thread. They all write into the same artifact, and a reader skimming a PR cannot always tell which actor originated a given CVE reference. That ambiguity is where the triage problem begins.

What does the corpus show about bots, agents, and humans?

Bots dominate vulnerability-identifier volume but contribute relatively few identifiers per mention, while humans and agents cite vulnerabilities more rarely but spread across more parts of the thread. That is the headline from “Who Said CVE?” (arXiv:2601.19636), an MSR 2026 paper reporting that bots account for roughly 69.1% of all vulnerability-identifier mentions in pull requests, usually adding few identifiers in PR descriptions for automated dependency updates and audits. Human and agent mentions are rarer but span more locations, including fixes, maintenance, and discussion. That figure measures identifier volume, not correctness or severity, and reading it as “bots handle most of the security work” would conflate presence with validation.

The agent-specific picture comes from “Security in the Age of AI Teammates” (arXiv:2601.00477), which identifies 1,293 confirmed security-related agentic PRs, roughly 4% of agent activity. Compared to non-security PRs, security-related agentic PRs exhibit lower merge rates and longer review latency, with variation across agents and programming ecosystems.

The same paper reports that security-related agentic PRs receive longer review latency than non-security PRs. The gap cuts two ways. It shows security PRs attract more reviewer attention, which is the right instinct. But rejection, the authors note, correlates more with PR complexity and verbosity (for example, longer titles) than with explicit security terminology, and the paper states outright that agents frequently produce false positives and claims that do not correspond to actual vulnerabilities. More attention is not the same as more scrutiny: reviewers are spending time, but not reliably on the question of whether the finding is real.

The corpus itself is substantial. The study draws on the AIDev dataset, comprising over 33,000 curated pull requests from popular GitHub repositories. A separate study, “Humans Integrate, Agents Fix” (arXiv:2604.04059), finds that humans initiate most references to agent-authored PRs, typically to build new features, while agents make references to fix errors. Referencing and referenced PRs show substantially longer lifespans and review times than isolated PRs, which the authors read as signaling higher coordination and integration effort.

Why does a CVE comment end the conversation?

The moment an automated actor posts a CVE reference into a PR thread, the discussion reads as already handled, and the informal scrutiny that used to catch false positives dissolves. None of the empirical papers above measures this “perceived-handled” effect directly; it is an inference from the patterns they document. The bot share of mentions is large, the references land in the same thread humans read, and reviewers have finite attention. When the thread already contains a CVE identifier, the cost of double-checking the reference rises relative to the cost of moving on, and most reviewers move on.

The failure mode this produces is dramatized, not documented, by the CVE-2026-LGTM satire published 2026-06-26. It depicts an AI triage assistant closing a correct credential-exfiltration detection as a false positive while the human reporter is rate-limited. A ByteIota analysis of the same piece frames the root cause as a coordination failure: each agent in a chain assumes a predecessor already performed the actual analysis, so none of them does. That is the precise mechanism behind the perceived-handled pattern across many threads at once.

CVE-2026-LGTM is satire, not an incident report. Its value here is as a clear illustration of what the corpus papers describe in aggregate: when an automated actor has spoken, the human reporter carries the burden of proving the finding is real, in-scope, and correctly severity-rated. The reporter is the one now doing triage, and they are doing it against a default that already says the matter is closed.

How do Dependabot and Codex Security shape the false-positive flood?

Dependabot is the dominant bot channel by which automated CVE and GHSA references enter pull request threads, and OpenAI’s newly launched Codex Security is the most visible attempt to cut the noise those references generate. The two sit on opposite sides of the same problem.

Per GitHub’s documentation, Dependabot operates through three features: Dependabot alerts, Dependabot security updates (auto-PRs for known-vulnerable dependencies), and Dependabot version updates (auto-PRs to keep dependencies current). Each can inject CVE or GHSA references into a PR thread, and that is the mechanism behind the bot mention share documented above. Dependabot is not the cause of the triage problem; it is the largest, most familiar instance of automated references landing where humans used to be the only voice.

Codex Security, formerly Aardvark, launched in research preview on 2026-06-26. OpenAI frames the problem as most AI security tools flagging low-impact findings and false positives that force security teams into triage, and the company reports cutting noise by up to 84%, reducing over-reported severity by more than 90%, and lowering false-positive rates by more than 50% in beta. That OpenAI is building a product around noise reduction is itself confirmation that the false-positive flood is the binding constraint: the agent security-PR corpus shows agents frequently producing false positives, and the volume is rising as agent-authored PRs grow.

A tool that filters before a reference reaches the thread attacks one part of the perceived-handled problem. It does not change what happens once a reference lands and the thread reads as triaged. Filtering at the source helps; it does not relieve reviewers of the job of confirming severity.

What should maintainers and security reporters change?

Once the burden of validation shifts back onto the reporter, the controls that matter are the ones that re-establish who owns severity validation and how irreversible actions get gated. The corpus points to three concrete adjustments.

First, treat automated CVE and GHSA references as unverified signal until a human confirms them, not as completed triage. The bot mention share measures identifier volume, not correctness or severity, and the agent false-positive rate is non-trivial enough that a reviewer reading a reference as “handled” is reading it wrong. The review-latency gap suggests security PRs attract attention; the problem is that the attention is not reliably spent separating real findings from false ones.

Second, gate irreversible actions behind an explicit human confirmation. The CVE-2026-LGTM satire exists because no actor in the chain owned the final severity call, and closing or dismissing a security report is irreversible enough that it should not inherit a default from a predecessor’s silence.

Third, on the reporter side, the operative assumption is now that an automated reference will appear in the thread before a human reads the finding. That makes reproduction steps, severity justification, and scope notes load-bearing rather than optional. A report that relies on the reviewer to reproduce a vague claim is the report most likely to be dismissed under the perceived-handled default.

The deeper shift is structural. Open-source security review used to run on the assumption that a CVE reference in a thread meant a human had assessed it. That assumption no longer holds when most references are bot-authored and an increasing share are agent-authored. The fix is not to remove the automation; it is to stop treating the presence of a reference as evidence of triage. Until the tooling makes “who validated this” explicit, the reporter is the triager, whether anyone assigned them the role or not.

Frequently Asked Questions

Which coding agents does the corpus track, and how recent is the data?

The AIDev-pop corpus follows five agents (OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code) across roughly 2,807 repositories with more than 500 GitHub stars, with collection cut off at August 1, 2025. Codex Security and the CVE-2026-LGTM satire both postdate that window by about ten months.

Which Dependabot pull request is most likely to smuggle a CVE past a reviewer?

A Dependabot version-update PR, opened on a schedule to keep dependencies current regardless of advisories, is the one reviewers already discount as routine churn. A security-update PR fires only against a known-vulnerable dependency and carries an implicit severity claim, so a CVE reference landing inside a version-update diff inherits the lower-scrutiny default.

Do some agents or ecosystems merge security pull requests more often than others?

The spread is wide. “Security in the Age of AI Teammates” reports security-PR merge rates from 49.60 percent for GitHub Copilot to 86.59 percent for OpenAI Codex, with Rust security PRs lowest at 51.16 percent. The gap tracks task mix and language norms, not agent competence, so treating it as a ranking overreads the data.

Claude Code, at 14.6 percent of its activity classified as security-related, the highest proportion among the five tracked agents; the aggregate across all of them is 3.85 percent. A large security-PR share does not imply a low false-positive rate, because the same study notes agents routinely file claims that do not match real vulnerabilities.