Claude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells

The shell you already gave away

The threat model in arXiv:2605.25871, posted May 25 by Yue Liu and colleagues, is almost embarrassingly simple. Agentic coding assistants like Claude Code, Cursor, and GitHub Copilot’s agent mode already hold the three capabilities an attacker needs to compromise a developer machine: file-write, shell-execute, and network egress. The attacker does not need to exploit a memory corruption bug or chain a privilege escalation. They need to get the agent to read a crafted string. The agent, running under the developer’s credentials, does the rest.

This is not a hypothetical. The paper formally maps how indirect prompt injection, hidden in artifacts the agent reads autonomously without user mediation, converts the coding assistant into a pivot point. The attacker plants a payload in a dependency’s README, a comment in an open issue, or a .env file in a cloned repo. The agent ingests it as context. The payload instructs the agent to exfiltrate secrets, open a reverse shell, or commit malicious code to the project. The agent complies, because from its perspective, it is following instructions.

The asymmetry is the point. Traditional supply-chain attacks require compromising the build infrastructure, the package registry, or the maintainer’s credentials. Here, the attacker compromises the reviewer’s tool. The agent sits between the developer and the codebase, and it already has the keys.

How indirect prompt injection reaches the agent

Direct prompt injection is the well-understood variant: a user pastes a malicious instruction into the chat window, and the model obeys it. That attack is loud and requires user action. The indirect variant is quieter and, for coding agents, far more dangerous.

The payload lives in an external artifact the agent reads on its own. Repo files are the obvious vector: a malicious contributor adds a comment in a config file, a poisoned .cursorrules file, or a markdown document with invisible Unicode instructions. Dependencies are another: a transitive dependency ships a setup.py whose docstring contains the payload. Issue comments on GitHub or GitLab are a third: the attacker opens an issue on a public repo with instructions the agent picks up when summarizing the issue backlog.

The paper catalogs these attack surfaces across the major coding assistants and notes that the current generation of agents treats all ingested text as authoritative context. There is no trust hierarchy between “instructions from the user” and “text from a random file in node_modules/.” The agent’s context window is flat.

The attacks that already shipped [Updated June 2026]

The “not whether but when” framing aged in under a year. Every vector the preprint describes now has a named CVE or a public proof-of-concept against a shipping product.

Rules files were first. In March 2025, Pillar Security disclosed the rules-file backdoor: hidden Unicode (zero-width joiners and bidirectional text markers) embedded in a .cursor/rules or .github/copilot-instructions.md file steers Cursor and Copilot into emitting backdoored code, with the instruction invisible in the editor and propagating through forked or shared rule files. Neither vendor assigned a CVE; both framed it as user responsibility, and GitHub later added a hidden-character warning to its web UI. The payload sits exactly where the threat model predicts: in a config artifact the agent reads as authoritative context. Committed trust-toggling config is the same class as TrustFall, where a checked-in settings file flips workspace-trust dialogs across Claude Code, Gemini CLI, Cursor, and Copilot CLI into unsandboxed code execution.

MCP turned out to be the richer surface. Two Cursor CVEs landed within days of each other in August 2025. CurXecute (CVE-2025-54135), from Aim Labs, used an indirect injection (delivered through a Slack MCP server, among others) to rewrite Cursor’s ~/.cursor/mcp.json and trigger command execution before the user could reject the suggested edit, yielding remote code execution under the developer’s account; fixed in Cursor 1.3. MCPoison (CVE-2025-54136), from Check Point, exploited the inverse: once an MCP server config was approved for a project, Cursor trusted it by name rather than by content, so an attacker could swap in malicious commands after approval and have them run silently on every later session. The fix forces re-approval whenever a config changes. Invariant Labs had already shown in April 2025 that a poisoned tool description alone, with no code change, could make Cursor exfiltrate ~/.ssh/id_rsa; benchmarks since confirm agents will trust a tool manual that lies about what the tool does.

Claude Code shipped its own instance. CVE-2025-52882, disclosed by Datadog Security Labs at CVSS 8.8, came from Claude Code’s IDE extensions running an unauthenticated WebSocket MCP server on localhost. A malicious web page the developer happened to visit could connect to it, read local files, and execute code. Anthropic patched it in 1.0.24 (late June 2025) and pulled the vulnerable extension builds. It is the cleanest illustration of the paper’s argument: the privileged runtime already existed, and the only missing piece was a channel to reach it.

GitHub Copilot answered the egress question. CamoLeak (CVE-2025-59145), CVSS 9.6, hid instructions in a pull-request description that made Copilot Chat exfiltrate private-repo contents one character at a time through GitHub’s own Camo image proxy, encoding each character as a request to a pre-signed image URL. The detail that matters for defenders: it bypassed network egress controls by riding infrastructure the organization already trusts. GitHub neutralized it by disabling image rendering in Copilot Chat. Restricting egress to whitelisted domains, one of the recommendations below, does not help when the exfiltration path is the vendor’s own CDN.

The common thread is that none of these needed a memory-corruption bug or a stolen credential. Each one got the agent to read attacker-controlled text, and the agent’s existing privileges did the rest. A structural threat model surviving a year of real disclosures intact is rare; this one did.

What the prevalence data shows

The paper measures the prevalence of indirect prompt injection attacks against coding assistants, though the specific attack-success rates and vendor-by-vendor breakdowns are in the full PDF rather than the abstract. The headline finding is structural rather than numerical: the attack surface is universal across the current generation of agentic coding tools because they all share the same design assumption, which is that ingested text is safe context.

Context from the broader benchmark literature reinforces the concern. EvoCode-Bench (arXiv:2605.24110) reports that the strongest coding agents achieve roughly 50% success on multi-turn evaluation metrics, and the aggregate pass rate drops below half of round-1 performance by round 5. Agents that survive longer expose specification-tracking and regression failures. An agent that cannot reliably follow the developer’s specification across five turns is an agent that cannot reliably reject a well-crafted injection hidden in round 3. The trust problem compounds with session length.

Why autonomous PRs and background agents make it worse

The uncomfortable timing is that vendors are shipping exactly the features that amplify this attack surface. Autonomous pull requests, where the agent opens a PR without the developer reviewing every line of the diff, are now a selling point. Background agents that run tasks over hours, cloning repos, resolving dependencies, and executing shell commands while the developer is away, are the next frontier.

The paper’s core argument lands here. The same capability surface vendors are racing to widen is the surface that makes the shell-pivot trivial. An agent that can autonomously clone a repo, install dependencies, run tests, and push commits is an agent that can autonomously exfiltrate secrets, install a persistent backdoor, and push malicious code. The only difference is the instruction it follows, and the attacker controls the instruction via the injected payload.

A permission prompt or approval workflow does not solve this. The agent’s autonomous read operations, browsing the issue tracker, scanning dependency metadata, reading documentation files, happen below the approval layer. The injection reaches the agent before the developer sees anything to approve.

The trust-class problem

This reframes how engineering organizations should treat agent-authored commits. Today, most review tooling treats a commit the same way regardless of who or what authored it. A PR from Copilot’s agent mode gets the same CI gates, the same review checklist, and the same merge workflow as a PR from a junior engineer.

That equivalence is wrong. An agent-authored commit carries a distinct threat profile. The agent may have been following a legitimate instruction from the developer, or it may have been following an injected instruction from an attacker. The commit message and the diff look identical either way. The review tooling cannot distinguish the two by examining the output alone.

Forge-side tooling, the review infrastructure on GitHub, GitLab, and similar platforms, needs to surface the provenance of agent-authored changes. Not just “this PR was generated by an AI” labeling, but runtime constraints: did the agent make network calls during execution? Did it write to files outside the repo? Did it read from paths that include untrusted dependencies? These are the observability signals that make agent-authored commits auditable.

What engineering teams should do now

The paper’s practical recommendations align with defense-in-depth. None of these are novel in isolation, but the threat model gives them new urgency.

Sandbox the agent runtime. The agent should not run with the developer’s full shell privileges. Containerize the execution environment. Restrict network egress to whitelisted domains (package registries, internal APIs). Mount the repository read-only unless the agent is explicitly in a write phase. These are standard sandboxing practices that most agent deployments currently skip in the name of convenience. This is also where Vercel’s secure-agent guidance lands: treat prompt injection as unsolvable at the model layer and push every control into the runtime, with least privilege as the default.

Gate agent-authored commits separately. Add CI checks that flag commits produced by agent workflows for additional review. Restrict what agents can push directly versus what requires human approval. The granularity matters: an agent updating documentation is a different risk from an agent modifying CI configuration or dependency manifests.

Treat untrusted artifacts as untrusted. The agent’s context window should not treat a node_modules/ README as equivalent to the user’s explicit instruction. Building a trust hierarchy into the agent’s input pipeline is an open research problem, but even coarse-grained filtering, like stripping or flagging content from files not tracked in the repo’s primary branch, would raise the bar.

What the survey literature adds

A concurrent survey, arXiv:2605.23989, published in Academia AI and Applications vol. 2 (2026), documents real-world security failures in open-source agentic systems and consolidates evaluation metrics for release-gating decisions: constraint violations, trace completeness, and adversarial success rates. The survey reinforces that the coding-agent threat model is not isolated. It is part of a broader pattern where agentic systems, given real-world tool access, fail in ways that are structurally similar to the privilege-escalation attacks the coding-agent paper describes.

Separately, arXiv:2605.23929 models the latency-reliability-cost tradeoffs in LLM-enabled agentic workflows and introduces a water-filling token allocation policy. The relevance is indirect but real: any runtime sandboxing, additional verification layers, or constrained execution environments for coding agents directly impacts the latency budget that vendors are optimizing against. There is a tension between shipping fast autonomous agents and shipping secure ones, and the current market incentives favor speed.

The coding-agent-as-attacker-shell framing in arXiv:2605.25871 is a preprint, not yet peer-reviewed, and the full prevalence measurements and defense evaluations are in the PDF rather than the abstract. But the structural argument does not require precise numbers to land. The agents already have the privileges. The injection vectors already exist in the workflows. The vendors are expanding the attack surface as a product feature. The original framing here was whether this gets exploited in the wild, and when. As of mid-2026 that question is settled: CurXecute, MCPoison, the Claude Code WebSocket flaw, and CamoLeak each turned one of these vectors into a patched CVE against a shipping product. [Updated June 2026] The open question now is narrower and harder, which is whether review and forge tooling can tell an injected commit from a legitimate one before it merges.

Frequently Asked Questions

Does the injection risk apply to inline autocomplete assistants, or only full agentic modes?

Inline autocomplete tools like Copilot’s tab-completion do not autonomously read from dependencies, issue trackers, or arbitrary repo files, so they lack the ingestion surface indirect prompt injection requires. The attacker-shell model depends on an agent that reads external artifacts and executes multi-step workflows without user mediation. Tab-completion is not in scope.

How does sandboxing an agent affect its ability to complete tasks?

The water-filling token allocation model from arXiv:2605.23929 shows that verification and sandboxing overhead competes directly with task-completion tokens in a fixed latency budget. Adding runtime controls does not just slow the agent down; it reduces the token capacity available for the actual coding task, which can lower output quality. This tradeoff explains why most current deployments skip sandboxing entirely.

If a long-running background agent goes off-spec, can you tell injection from ordinary degradation?

Probably not from the output alone. EvoCode-Bench shows that agents lose specification fidelity by round 5 without any adversarial input. A background agent running for dozens of turns will drift from its instructions through normal degradation, producing output that is structurally indistinguishable from successful prompt injection. Incident response teams cannot rely on output inspection to determine whether a bad commit came from an attacker or from model limitations.

Has the attacker-shell model actually been exploited, or is it still theoretical?

It is no longer theoretical. By mid-2026 each major vector had a named CVE against a shipping product: CurXecute (CVE-2025-54135) and MCPoison (CVE-2025-54136) in Cursor’s MCP handling, CVE-2025-52882 in Claude Code’s IDE extensions, and CamoLeak (CVE-2025-59145) in GitHub Copilot Chat. The rules-file backdoor that Pillar Security disclosed for Cursor and Copilot got no CVE but was reproduced publicly. None required a software exploit in the classic sense; each got the agent to read attacker-controlled text and used the privileges it already held.

What can’t engineering teams do right now because of gaps in the published research?

Set quantitative benchmarks for their agent-sandboxing measures. The coding-agent paper’s defense-effectiveness evaluations sit behind the full PDF and are not yet publicly extractable, and the broader survey literature does not provide vendor-specific adversarial success rates usable as baselines. Without published numbers on how often specific sandboxing measures block specific injection vectors, teams are designing controls without a reference point for whether their runtime is hardened enough.