Indirect Prompt Injection Benchmarks Were Too Easy: LivePI Adds Realism

Static prompt-injection benchmarks have been giving agent frameworks a false sense of security. LivePI, a benchmark that runs live multi-surface attacks on a real virtual machine rather than against canned payloads, reports attack success rates of 10.7% to 29.6% across five frontier models. Group-chat injection succeeded uniformly against every model evaluated. The gap between published defense scores and what happens under realistic conditions is wider than the compliance checklists suggest.

Why static benchmarks miss

Prior prompt-injection evaluations share a structural flaw: they test fixed payloads against narrow input channels, typically a single text field or a simulated API response. A defense that blocks a known payload under controlled conditions tells you little about what happens when the attack comes through a group-chat message, a file attachment, or a linked repository, all of which arrive with their own rendering context and tool-call surface.

LivePI replaces the simulation with a real virtual machine running live (test-controlled) email, chat, web, local-file, repository, and wallet interfaces. Seven input surfaces in total, twelve attack and rendering families, and five distinct malicious goals: protected-information exfiltration, unauthorized security-control changes, unsafe code execution, inbox-summary exfiltration, and cryptocurrency transfer. The coverage is not exhaustive, but it is closer to what an agent encounters in production than any prior published benchmark.

The numbers

Across five frontier models, LivePI reports total attack success rates ranging from 10.7% at the low end to 29.6% at the high. The models tested are GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5. The paper does not break out per-model rates in its abstract, so attributing the floor or ceiling to a specific model requires the full text.

Two vectors stand out. Group-chat injection succeeded against every model backbone evaluated, with no exception. Repository-link attacks produced high-severity failures as well, though the paper notes a small denominator for that category, so the severity ratio may shift with more samples.

Group-chat and repository vectors

Group-chat injection’s universal success deserves attention on its own. When an agent reads a shared channel, every message in that channel is an input surface, and messages from other participants carry implicit trust that the model has no principled way to verify. The attacker does not need to compromise the agent’s system prompt; they only need to post a message that the agent will ingest during its next polling cycle.

Repository-link attacks operate similarly: an agent fetching a URL for context gets the attacker’s payload along with the repository content. The small sample size in LivePI’s evaluation makes it premature to call this a confirmed weakness, but the high-severity outcomes in the data that exists are enough to warrant monitoring.

The defense that worked

The paper reports that a two-layer defense, prompt-level filtering combined with pre-execution tool-call authorization, intercepted all tested malicious-goal completions in the GPT-5.3-Codex setting while preserving benign utility on PinchBench-derived workloads.

The two-layer approach is worth spelling out because it is straightforward to replicate. The first layer scans incoming content for injection patterns before it reaches the model’s context window. The second layer intercepts outgoing tool calls and requires explicit authorization before any side-effecting action executes. Neither layer is novel in isolation; the contribution is the empirical confirmation that together they held under LivePI’s attack conditions, at least on one model.

What this means for agent-framework security claims

Frameworks like CrewAI, LangGraph, and AutoGen ship with tool-use defaults and documentation that often frame prompt-injection defenses as a configuration toggle. That framing is not wrong, but it is incomplete. The LivePI results suggest that toggling on a defense and passing a static red-team suite is not evidence of security under realistic conditions. The paper does not directly test these frameworks; the inference is editorial, based on the gap LivePI exposes between static and live evaluation.

Security teams treating prompt-injection benchmarks as a compliance checkbox now have a concrete reason to revalidate against adaptive, multi-surface payloads rather than fixed ones. The LivePI methodology (real VM, live interfaces, multiple malicious goals) is reproducible, and the two-layer defense architecture it validates gives practitioners a starting point. The benchmark reliability problem extends beyond prompt injection: AI safety benchmark rankings have been shown to flip entirely based on evaluation configuration, which means leaderboard positions without full eval-setup disclosure carry limited signal. Similarly, MCP tool description poisoning benchmarks demonstrate that the attack surface expands as agentic tool ecosystems grow.

Newer Anthropic models have added properties relevant to injection resistance. Claude Opus 4.8 (May 2026) increased honesty calibration; Claude Fable 5 (June 9, 2026), now Anthropic’s most capable widely released model, ships with cybersecurity classifiers that block offensive cyber tasks and achieved zero compliance across all 30 jailbreak techniques Anthropic tested. Neither property directly maps to the five LivePI malicious-goal categories, which include protected-information exfiltration and inbox-summary exfiltration; LivePI evaluated Claude Opus 4.6, and no analogous live-VM evaluation of Fable 5 has been published. The structural argument for architectural prompt-injection containment applies regardless of model generation: stronger classifiers and jailbreak resistance reduce one attack surface, but an adversary posting a legitimate-looking group-chat message never needs to jailbreak the model at all.

The broader signal is uncomfortable: if attack success rates of 10.7, 29.6% are what happens when researchers test in good faith on controlled infrastructure, the numbers in adversarial production environments are unlikely to be lower.

Frequently Asked Questions

Were specific agent frameworks tested in LivePI?

No. The paper evaluates bare model backbones on a custom VM, not any framework’s tool-use pipeline. A framework’s default tool-call routing, permission model, and input-sanitization layers all change the attack surface compared to a bare backbone, so teams running CrewAI or LangGraph would need to replicate LivePI’s methodology against their own stack to get grounded numbers.

What are the biggest caveats in the two-layer defense results?

The prompt-filtering plus tool-call-authorization defense was validated only on GPT-5.3-Codex, with benign-utility preservation measured against PinchBench-derived workloads. Effectiveness on the other four evaluated models remains untested. Because the filter and authorization layers are model-dependent (different models produce different tool-call formats and obey filter instructions differently), the 100% interception rate should not be assumed portable without replication.

What does implementing the two-layer defense require in practice?

The input-scanning layer must intercept content from every surface the agent reads before it enters the context window. The tool-call authorization layer needs to parse outgoing model actions and hold them for explicit approval before execution, which means every tool-call path needs an intercept point. Frameworks that execute tool calls inline, without a pre-execution hook, would require architectural changes to support this. The paper does not publish implementation code or latency measurements.

Will these attack success rates hold as models update?

Unlikely. The 10.7 to 29.6% range was measured against specific May 2026 model snapshots (the paper was submitted May 18 and revised May 23). Model providers routinely patch instruction-following behavior in response to published attack research. The structural finding that multi-surface attacks outperform single-channel ones should persist, but the specific percentages are point-in-time measurements that shift with each model release.

How reliable are the repository-link attack findings?

The paper itself flags a small denominator for repository-link attacks, meaning the high-severity failure rate comes from relatively few test cases. This contrasts with group-chat injection, which succeeded uniformly across all five backbones with a larger sample. The repository-link vector warrants monitoring but the current data is too thin to treat as a reproducible weakness at the same confidence level as the group-chat result.