Three Labs Concede Browser Agents Cannot Stop Prompt Injection

Three labs, one conclusion

OpenAI, Anthropic, and Google DeepMind have each, independently, said roughly the same thing about prompt injection in browsing agents: it is an architectural property of the system, not a bug to patch. OpenAI stated publicly that prompt injection is “unlikely to ever be fully solved” and that agent mode in ChatGPT Atlas “expands the security threat surface.” Anthropic acknowledges that all agents processing untrusted content are subject to prompt injection risks, noting that “prompt injection is far from a solved problem, particularly as models take more real-world actions.” The UK’s National Cyber Security Centre echoed the position, advising cybersecurity professionals to reduce risk and impact rather than expect prompt injection to be stopped outright.

This is not three companies hedging for liability reasons. It is a convergence on a hard constraint of information theory: any system that processes untrusted input as instructions will, by definition, have an attack surface proportional to the expressiveness of that input. When the input language is natural language and the instruction parser is a large language model, the attack surface is the full generative capacity of the model.

The attack mechanics: one URL, multiple failure modes

Google DeepMind researchers published a six-category framework cataloguing attack types against autonomous agents: Content Injection Traps, Semantic Manipulation Traps, Cognitive State Traps, Behavioural Control Traps, Data Exfiltration Traps, and Sub-agent Spawning Traps. The numbers are stark. Human-written injection prompts commandeered agents in up to 86% of tested scenarios (source). Data exfiltration succeeded over 80% of the time across five tested agents (source).

Content Injection Traps exploit the gap between how humans perceive a webpage and how agents parse its code, hiding malicious instructions in invisible markup or steganographic image data. Semantic Manipulation Traps corrupt reasoning through biased phrasing and framing effects rather than overt commands. Cognitive State Traps target long-term memory via RAG Knowledge Poisoning, injecting fabricated statements into retrieval corpora. Behavioural Control Traps directly hijack agent actions. Data Exfiltration Traps coerce agents to transmit sensitive data to attacker-controlled endpoints. Sub-agent Spawning Traps exploit orchestrator-level privileges to spawn attacker-controlled child agents inside trusted workflows.

On the offensive side, OpenAI trained an RL-based “LLM automated attacker” that discovers novel attack strategies not found by human red teams. The bot tests attacks in simulation, studies the target AI’s internal reasoning, then iterates. OpenAI’s position is that this internal adversary, with access to the model’s reasoning traces, should find flaws faster than external attackers could.

The defense stack so far

Anthropic has been the most detailed about its defensive posture for Claude for Chrome. The company reports “significant progress” on prompt injection robustness since launching the extension in research preview, evaluated against an internal adaptive “Best-of-N” attacker that repeatedly tries and combines attack strategies. Claude Opus 4.5 “sets a new standard in robustness to prompt injections” according to Anthropic, representing a “major improvement over previous models.” The caveat: this is measured against Anthropic’s own internal attacker. Real-world success rates against novel attacks from independent adversaries are almost certainly higher, and Anthropic explicitly acknowledges that “prompt injection is far from a solved problem.”

On the academic side, the Adversarial Prompt Disentanglement (APD) framework, published at AAAI 2026, takes a different approach. APD uses mutual information-based semantic decomposition to separate instruction from injection, then applies graph-based spectral intent classification to identify malicious payloads. The reported result: over 85% reduction in harmful output generation with negligible impact on model performance. Whether this transfers from benchmark settings to production agent deployments remains an open question.

Why browser agents are the hardest case

Anthropic’s own analysis frames the core risk: browser agents face a vast attack surface where every webpage, embedded document, advertisement, and dynamically loaded script is a potential vector, combined with high-impact action capabilities like navigating to URLs, filling forms, clicking buttons, and downloading files. The DeepMind framework underscores the same point: agents that autonomously execute financial transactions, manage emails, and call external APIs operate in an environment where the information itself has become hostile. When a compromised agent carries out an unauthorized transaction or exfiltrates email, the question of who bears liability remains unresolved. No existing legal framework cleanly assigns responsibility across the agent operator, the model provider, and the attacker.

What practitioners should do now

The convergence from OpenAI, Anthropic, DeepMind, and the UK NCSC points to a clear operational posture: treat prompt injection as you would treat any untrusted-input vulnerability in a security-critical system. You do not wait for the vulnerability class to be “solved.” You layer compensating controls.

Scope agent permissions to the minimum required for the task. An agent that reads calendar entries does not need email send permission. An agent that compares product prices does not need access to payment methods. Reducing either autonomy or access reduces risk linearly.

Human-in-the-loop gates for irreversible actions. Any action that sends email, transfers money, modifies production systems, or commits to a binding agreement should require explicit human confirmation. The confirmation prompt itself is an injection target, so the gate should be implemented outside the agent’s control flow, in the runtime rather than in the model’s output.

Input sanitization at the runtime layer. The APD framework suggests that a preprocessing layer between untrusted web content and the model can catch a significant fraction of injections without requiring the model itself to be injection-proof. This is defense in depth, not a substitute for model-level resistance.

Audit logs for every agent action. If an agent is compromised, the first question is “what did it do?” If actions are not logged with enough context to reconstruct the agent’s reasoning chain, post-incident analysis is guesswork.

None of these mitigations eliminate the risk. The three labs that build these models have said as explicitly as corporate communications allow that elimination is not on the table. The question is whether the risk can be reduced to a level that the value justifies. For narrow, well-scoped agent tasks behind strict permission boundaries, the answer is probably yes. For a general-purpose browser agent with access to your email and your credit card, the answer, right now, is probably not.

Frequently Asked Questions

Does Dynamic Cloaking mean the attack is invisible to human auditors?

DeepMind documented that malicious servers can fingerprint incoming traffic to identify AI agents, then serve pages that look identical to what a human would see but carry different semantic content with injection payloads. A human reviewing the same URL on their own browser loads the clean version, making traditional URL-based auditing ineffective against this class of attack.

If an agent transfers funds after being hijacked, who is legally liable?

DeepMind researchers named this the Accountability Gap. No existing legal framework assigns responsibility across the three involved parties: the agent operator, the model provider, and the domain owner hosting the injection payload. All three have plausible arguments that another party bears fault, and none of the major AI lab terms of service have been tested in court for agent-initiated financial loss.

Anthropic reports roughly 1% attack success. Is that a safe threshold for deployment?

That 1% figure comes from Anthropic’s internal adaptive Best-of-N attacker, which has access to the model but not to real-world adversarial creativity or the full diversity of web content. It is not directly comparable to the 86% hijack rate in the DeepMind study because they measure different threat models against different agents. A practitioner treating 1% as a deployment floor should assume real-world rates could be an order of magnitude higher against strategies the internal evaluator does not cover.

What did OpenAI’s automated attacker actually demonstrate?

OpenAI’s RL-trained attacker found a strategy no human red team had devised: it inserted a crafted email into the agent’s inbox that caused the agent to send a resignation letter to the user’s employer instead of a routine out-of-office reply. The attacker had access to the target model’s reasoning traces, an advantage external attackers lack, but the novelty of the strategy relative to all human-generated test suites suggests the space of viable injections is larger than manual testing covers.