Web Agents Can Be Talked Into Abandoning Their Task: The TRAP Benchmark

What TRAP Measures, and Why the Distinction Matters

Most agent-safety benchmarks test whether a model can be commanded to do something it shouldn’t. The TRAP benchmark, accepted to ICML 2026, tests something else: whether a model can be argued into abandoning its task. The difference matters because the defenses built for one failure mode do not generalize to the other. Instruction-hierarchy guardrails catch alien commands. They are not designed to catch persuasive content planted in a page the agent was already instructed to read.

TRAP (Task-Redirecting Agent Persuasion Benchmark) constructs attacks using a modular social-engineering injection framework with controlled variations across multiple attack parameters, allowing the benchmark to decompose why an attack works rather than just reporting that it does. The testbed runs controlled experiments on high-fidelity website clones.

Crucially, TRAP defines attack success at the interaction boundary: a single agent action that transfers execution into attacker-controlled context. Prior benchmarks like InjecAgent, AgentDojo, and OS-HARM rely on LLM-judged multi-step outcomes, which conflate the initial redirect with everything that happens after. By isolating the critical decision point, TRAP can measure whether the agent took the bait before downstream consequences muddy the signal.

The Numbers: 13% to 43% Across Six Frontier Models

Across six frontier models, TRAP measured an average attack success rate of 25%, ranging from a low of 13% on GPT-5 to a high of 43% on DeepSeek-R1. The spread is wide enough that citing the average alone is misleading. A model that resists one in eight persuasion attempts and a model that falls for nearly half of them are in different risk categories, even if they share the same benchmark.

The DeepSeek-R1 result deserves a caveat the paper’s framing doesn’t emphasize: its reasoning-chain architecture, which produces visible chain-of-thought traces, may make it particularly susceptible to persuasive context that embeds itself in the reasoning process. Whether that 43% generalizes to other open-weight models or is architecture-specific remains an open question.

Interface Design as an Attack Multiplier

One of TRAP’s more concrete findings is that interface design functions as an attack multiplier. The benchmark tests both button-based and hyperlink-based injection vectors. The mechanism is straightforward: buttons signal primary actions in web UIs. An agent trained to interact with pages by clicking prominent elements will treat a well-placed malicious button the same way it treats a legitimate one, because the visual semantics of the interface tell it to.

Light contextual tailoring, adjusting the injected content to match the page’s domain or the user’s apparent task, often doubled success rates. Small interface changes and minor contextual adjustments produced similar effects on their own. The modular framework makes this visible: you can isolate which attack parameters contributed to success, rather than treating the attack as a binary pass/fail event.

This has direct implications for anyone deploying web agents in production. The attack surface isn’t limited to the content of the page. It includes the structure of the UI the agent navigates. A defense that screens page text for suspicious instructions but ignores the presentation layer is screening the wrong thing.

As of 2026, agent-safety work at the major labs converges on instruction-hierarchy approaches: system prompts that tell the model to prioritize developer instructions over user or page content, and training that reinforces that boundary. Anthropic and OpenAI have both shipped versions of this. The assumption is that the model can distinguish between “instructions I should follow” and “instructions I should ignore” based on their source.

TRAP’s results suggest that assumption breaks down when the external content doesn’t present as an instruction at all. A planted button that says “Complete your purchase now to avoid losing your cart” isn’t a command in the syntactic sense. It’s a contextual suggestion that exploits the agent’s existing task framework. The agent isn’t being told to do something new; it’s being nudged to do the same thing somewhere else. Persuasion exploits the model’s capacity for contextual reasoning, which is the same capacity that makes the agent useful.

The real-world precedents illustrate this gap. Perplexity’s Comet browser was misled by malicious directives hidden in Reddit posts. The Odin Bounty Program demonstrated that Gemini could be manipulated by invisible white text in Gmail. In both cases, the agent processed the content as part of its normal browsing task. The content wasn’t overriding instructions. It was embedding itself in the context the agent was already committed to processing.

What This Means for Agent Safety Certification

TRAP raises the cost of certifying an agent as safe. Before this benchmark, the evaluation path was relatively clear: run the agent against injection suites, confirm it refuses adversarial commands, ship. The persuasion surface adds a dimension that existing suites don’t cover, and the modular structure means the attack space is large enough that spot-checking a few configurations won’t suffice.

The benchmark’s design choice to evaluate at the interaction boundary rather than the outcome level is itself a contribution. Prior benchmarks that judge multi-step outcomes with LLM evaluators face two problems: the evaluator can be wrong, and the outcome conflates the redirect with the agent’s subsequent behavior. An agent that takes one step into attacker-controlled context and then recovers looks “safe” on an outcome metric. On TRAP’s boundary metric, that agent already failed, because the single redirect is the critical decision point that enables everything downstream.

For teams building web agents, the practical takeaway is that defense-in-depth needs to extend past instruction filtering. Monitoring the agent’s navigation path for unexpected domain transitions, rate-limiting actions on unfamiliar elements, and treating any page content that suggests task redirection as suspicious by default are all strategies the benchmark’s structure supports. None of these are novel suggestions. TRAP’s contribution is showing that the existing defenses are insufficient without them.

Open Questions and the Reproducibility Gap

The benchmark uses website clones rather than live sites, which controls for confounds but may not capture the full complexity of real-world page structures. The authors note this limitation. Whether the ASR numbers translate directly to production deployments is an empirical question the paper doesn’t claim to answer.

The authors provide a modular framework designed for further benchmark expansion, which should allow replication and extension. The modular structure means other researchers can test new persuasion principles, new interaction vectors, or new contextual tailoring strategies without rebuilding the entire benchmark. That extensibility matters more than any single ASR number, because the persuasion surface is not static. As agents get better at reasoning about context, the attacks that work against them will shift.

The ICML 2026 acceptance gives the persuasion-vs-injection distinction institutional weight it didn’t have when the paper first appeared in December 2025. Whether that translates into changes in how labs evaluate agent safety is a separate question. The benchmark exists, the numbers are published, and the gap it identifies in current defenses is documented. What happens next depends on whether the agent-safety community treats persuasion as a first-class attack surface or continues to fold it into the injection bucket and hope the instruction hierarchy catches it.

Frequently Asked Questions

What specific attack dimensions does TRAP decompose, and how many combinations does that produce?

TRAP varies five independent parameters per attack: the human persuasion principle (drawn from Cialdini’s framework), the LLM manipulation method, the interaction vector, the injection location on the page, and the degree of contextual tailoring. Crossing these dimensions yields 630 distinct task-injection combinations across clones of Amazon, Gmail, Google Calendar, LinkedIn, DoorDash, and Upwork, all running on the REAL simulation environment. Prior suites treat an injection as a single monolithic block and cannot isolate which parameter drove a success.

Does the agent’s observation modality (screenshot, DOM, accessibility tree) change attack success rates?

Very little. The paper tested accessibility tree, screenshot, and DOM observation types and found small differences in both benign task completion and attack success rates across all three. TRAP defaults to accessibility tree because it supports the widest range of models and keeps per-run costs down. Teams relying on screenshot or DOM input should not assume their modality choice reduces exposure to task redirection.

Which production deployments face the highest exposure to task-redirection attacks?

Agents that navigate platforms heavy with user-generated content face the most direct risk. TRAP’s six target sites (Amazon, Gmail, Google Calendar, LinkedIn, DoorDash, Upwork) were chosen precisely because agents on these surfaces routinely process third-party listings, emails, messages, and job posts. Any deployment where the agent clicks through pages containing untrusted submissions inherits the same attack surface, regardless of instruction-hierarchy prompts.

Can other researchers extend TRAP with new attack strategies or target sites?

Yes. Each of the five attack dimensions is an independently swappable module. Researchers can introduce new persuasion principles, swap in different LLM manipulation methods, or add new website clones to the REAL simulation environment without rebuilding the benchmark from scratch. This extensibility matters because the persuasion surface shifts as models improve their contextual reasoning; a static benchmark would age out quickly.