The shell you already gave away
The threat model in arXiv:2605.25871, posted May 25 by Yue Liu and colleagues, is almost embarrassingly simple. Agentic coding assistants like Claude Code, Cursor, and GitHub Copilot’s agent mode already hold the three capabilities an attacker needs to compromise a developer machine: file-write, shell-execute, and network egress. The attacker does not need to exploit a memory corruption bug or chain a privilege escalation. They need to get the agent to read a crafted string. The agent, running under the developer’s credentials, does the rest.
This is not a hypothetical. The paper formally maps how indirect prompt injection, hidden in artifacts the agent reads autonomously without user mediation, converts the coding assistant into a pivot point. The attacker plants a payload in a dependency’s README, a comment in an open issue, or a .env file in a cloned repo. The agent ingests it as context. The payload instructs the agent to exfiltrate secrets, open a reverse shell, or commit malicious code to the project. The agent complies, because from its perspective, it is following instructions.
The asymmetry is the point. Traditional supply-chain attacks require compromising the build infrastructure, the package registry, or the maintainer’s credentials. Here, the attacker compromises the reviewer’s tool. The agent sits between the developer and the codebase, and it already has the keys.
How indirect prompt injection reaches the agent
Direct prompt injection is the well-understood variant: a user pastes a malicious instruction into the chat window, and the model obeys it. That attack is loud and requires user action. The indirect variant is quieter and, for coding agents, far more dangerous.
The payload lives in an external artifact the agent reads on its own. Repo files are the obvious vector: a malicious contributor adds a comment in a config file, a poisoned .cursorrules file, or a markdown document with invisible Unicode instructions. Dependencies are another: a transitive dependency ships a setup.py whose docstring contains the payload. Issue comments on GitHub or GitLab are a third: the attacker opens an issue on a public repo with instructions the agent picks up when summarizing the issue backlog.
The paper catalogs these attack surfaces across the major coding assistants and notes that the current generation of agents treats all ingested text as authoritative context. There is no trust hierarchy between “instructions from the user” and “text from a random file in node_modules/.” The agent’s context window is flat.
What the prevalence data shows
The paper measures the prevalence of indirect prompt injection attacks against coding assistants, though the specific attack-success rates and vendor-by-vendor breakdowns are in the full PDF rather than the abstract. The headline finding is structural rather than numerical: the attack surface is universal across the current generation of agentic coding tools because they all share the same design assumption, which is that ingested text is safe context.
Context from the broader benchmark literature reinforces the concern. EvoCode-Bench (arXiv:2605.24110) reports that the strongest coding agents achieve roughly 50% success on multi-turn evaluation metrics, and the aggregate pass rate drops below half of round-1 performance by round 5. Agents that survive longer expose specification-tracking and regression failures. An agent that cannot reliably follow the developer’s specification across five turns is an agent that cannot reliably reject a well-crafted injection hidden in round 3. The trust problem compounds with session length.
Why autonomous PRs and background agents make it worse
The uncomfortable timing is that vendors are shipping exactly the features that amplify this attack surface. Autonomous pull requests, where the agent opens a PR without the developer reviewing every line of the diff, are now a selling point. Background agents that run tasks over hours, cloning repos, resolving dependencies, and executing shell commands while the developer is away, are the next frontier.
The paper’s core argument lands here. The same capability surface vendors are racing to widen is the surface that makes the shell-pivot trivial. An agent that can autonomously clone a repo, install dependencies, run tests, and push commits is an agent that can autonomously exfiltrate secrets, install a persistent backdoor, and push malicious code. The only difference is the instruction it follows, and the attacker controls the instruction via the injected payload.
A permission prompt or approval workflow does not solve this. The agent’s autonomous read operations, browsing the issue tracker, scanning dependency metadata, reading documentation files, happen below the approval layer. The injection reaches the agent before the developer sees anything to approve.
The trust-class problem
This reframes how engineering organizations should treat agent-authored commits. Today, most review tooling treats a commit the same way regardless of who or what authored it. A PR from Copilot’s agent mode gets the same CI gates, the same review checklist, and the same merge workflow as a PR from a junior engineer.
That equivalence is wrong. An agent-authored commit carries a distinct threat profile. The agent may have been following a legitimate instruction from the developer, or it may have been following an injected instruction from an attacker. The commit message and the diff look identical either way. The review tooling cannot distinguish the two by examining the output alone.
Forge-side tooling, the review infrastructure on GitHub, GitLab, and similar platforms, needs to surface the provenance of agent-authored changes. Not just “this PR was generated by an AI” labeling, but runtime constraints: did the agent make network calls during execution? Did it write to files outside the repo? Did it read from paths that include untrusted dependencies? These are the observability signals that make agent-authored commits auditable.
What engineering teams should do now
The paper’s practical recommendations align with defense-in-depth. None of these are novel in isolation, but the threat model gives them new urgency.
Sandbox the agent runtime. The agent should not run with the developer’s full shell privileges. Containerize the execution environment. Restrict network egress to whitelisted domains (package registries, internal APIs). Mount the repository read-only unless the agent is explicitly in a write phase. These are standard sandboxing practices that most agent deployments currently skip in the name of convenience.
Gate agent-authored commits separately. Add CI checks that flag commits produced by agent workflows for additional review. Restrict what agents can push directly versus what requires human approval. The granularity matters: an agent updating documentation is a different risk from an agent modifying CI configuration or dependency manifests.
Treat untrusted artifacts as untrusted. The agent’s context window should not treat a node_modules/ README as equivalent to the user’s explicit instruction. Building a trust hierarchy into the agent’s input pipeline is an open research problem, but even coarse-grained filtering, like stripping or flagging content from files not tracked in the repo’s primary branch, would raise the bar.
What the survey literature adds
A concurrent survey, arXiv:2605.23989, published in Academia AI and Applications vol. 2 (2026), documents real-world security failures in open-source agentic systems and consolidates evaluation metrics for release-gating decisions: constraint violations, trace completeness, and adversarial success rates. The survey reinforces that the coding-agent threat model is not isolated. It is part of a broader pattern where agentic systems, given real-world tool access, fail in ways that are structurally similar to the privilege-escalation attacks the coding-agent paper describes.
Separately, arXiv:2605.23929 models the latency-reliability-cost tradeoffs in LLM-enabled agentic workflows and introduces a water-filling token allocation policy. The relevance is indirect but real: any runtime sandboxing, additional verification layers, or constrained execution environments for coding agents directly impacts the latency budget that vendors are optimizing against. There is a tension between shipping fast autonomous agents and shipping secure ones, and the current market incentives favor speed.
The coding-agent-as-attacker-shell framing in arXiv:2605.25871 is a preprint, not yet peer-reviewed, and the full prevalence measurements and defense evaluations are in the PDF rather than the abstract. But the structural argument does not require precise numbers to land. The agents already have the privileges. The injection vectors already exist in the workflows. The vendors are expanding the attack surface as a product feature. The question is not whether this gets exploited in the wild, but when, and whether the review tooling will be ready to catch it.
Frequently Asked Questions
Does the injection risk apply to inline autocomplete assistants, or only full agentic modes?
Inline autocomplete tools like Copilot’s tab-completion do not autonomously read from dependencies, issue trackers, or arbitrary repo files, so they lack the ingestion surface indirect prompt injection requires. The attacker-shell model depends on an agent that reads external artifacts and executes multi-step workflows without user mediation. Tab-completion is not in scope.
How does sandboxing an agent affect its ability to complete tasks?
The water-filling token allocation model from arXiv:2605.23929 shows that verification and sandboxing overhead competes directly with task-completion tokens in a fixed latency budget. Adding runtime controls does not just slow the agent down; it reduces the token capacity available for the actual coding task, which can lower output quality. This tradeoff explains why most current deployments skip sandboxing entirely.
If a long-running background agent goes off-spec, can you tell injection from ordinary degradation?
Probably not from the output alone. EvoCode-Bench shows that agents lose specification fidelity by round 5 without any adversarial input. A background agent running for dozens of turns will drift from its instructions through normal degradation, producing output that is structurally indistinguishable from successful prompt injection. Incident response teams cannot rely on output inspection to determine whether a bad commit came from an attacker or from model limitations.
What can’t engineering teams do right now because of gaps in the published research?
Set quantitative benchmarks for their agent-sandboxing measures. The coding-agent paper’s defense-effectiveness evaluations sit behind the full PDF and are not yet publicly extractable, and the broader survey literature does not provide vendor-specific adversarial success rates usable as baselines. Without published numbers on how often specific sandboxing measures block specific injection vectors, teams are designing controls without a reference point for whether their runtime is hardened enough.