Trojan Hippo Plants Dormant Payloads in Agent Memory, Hits 85-100% Exfiltration on Frontier Models

A single untrusted email can plant a dormant payload in an LLM agent’s long-term memory, survive a hundred benign sessions, and later exfiltrate data on finance, health, or identity at 85, 100% attack success rate on frontier models.¹ Das et al. treat memory not as a passive store but as a surface for persistent compromise, and the results suggest that every major agent framework treats that surface as an afterthought.

The Two-Stage Attack: Injection, Dormancy, and Topic-Triggered Activation

The mechanism is straightforward in structure and devastating in patience. An attacker sends a single crafted email or other untrusted tool input. The agent ingests it, summarises it, or stores it in memory via any of the four standard backends: explicit tool memory, agentic memory (e.g., Mem0), a RAG vector store, or a long sliding-window context. The payload is dormant. It does not trigger immediately. It waits.

Only when the user later discusses a sensitive topic, finance, health, legal, tax, or identity, does the planted memory activate and steer the agent toward data exfiltration. The authors report that these memories survive 100 benign sessions¹ before triggering, which severs the causal link that any single-session monitor or prompt-injection detector would notice. A sandbox that inspects the current turn sees nothing wrong. The attack happened turns ago.

This is not a prompt-injection in the usual sense. The user is not being social-engineered on the spot. The attack writes state into the agent’s memory primitives, and the agent later retrieves that state autonomously. The framework executes the retrieval; the model executes the compliance.

The Numbers: 85, 100% ASR Across Four Memory Backends¹

Das et al. tested four memory architectures against Gemini 3.1 Pro and GPT-5-mini.¹ With no defenses, the attack success rate (ASR) is stark:

Against Gemini 3.1 Pro: 95% on Explicit Memory, 85% on Mem0, 80% on RAG, and 100% on Context (sliding-window).¹ Against GPT-5-mini: 15% on Explicit, 85% on Mem0, 80% on RAG, and 60% on Context.¹ The variance across backends matters more than the exact top-line figures. The attack is not a niche exploit against one memory type. It works wherever the model can read its own prior state.

The authors also report transfer from GPT-5-mini to GPT-5 at 70% ASR on RAG without re-optimisation², which suggests the payload generalises across model sizes in the same family. A builder who swaps in a larger model for reliability does not get safety for free.

Why Input Sanitization and RAG Filtering Fail

A companion paper by Leong et al.³ tested six defenses across four architectural layers and found that the intuitive ones fail by design. Input-level sanitisation (Minimizer, Sanitizer) and retrieval-level filtering (RAG Sanitizer, RAG LLM Judge) all fail at roughly 88.9% ASR³, statistically indistinguishable from an undefended baseline.

The reason is architectural. The payload does not enter through the user query. It enters through a tool call, is written to memory, and is later retrieved by the framework itself. Sanitising the user’s latest message or filtering the retrieved chunks against the current prompt does not touch the memory write path. The attack lives in the gap between ingestion-time checks and retrieval-time checks.

The Four Defenses and Their Utility Costs

Das et al. evaluated four memory-system defenses against Trojan Hippo. The Provable policy¹, an information-flow control (IFC) mechanism, reduced ASR to 0% across all backends. The No-untrusted-write¹ defense cut ASR to 0, 5%. Limit-memory-length was the weakest, leaving 30% residual ASR on RAG.¹ User-prompt-only, which restricts memory access to the current user turn, sits between.

These numbers come with tradeoffs. No-untrusted-write prevents the agent from ever writing memories derived from untrusted inputs, which breaks legitimate use cases: an agent that needs to remember an emailed invoice, a calendar invite, or a support ticket cannot function under this rule. Provable policy requires building IFC into the memory layer, which is not a configuration toggle. It is a redesign.

The authors released their OpenEvolve-based benchmark⁴ to make this tradeoff measurable. Builders can now stress-test a memory backend against adaptive red-teaming rather than trusting a static safety eval.

What Agent Frameworks Must Do Differently

LangChain, Letta, MemGPT, and CrewAI all treat memory as an architectural primitive: a vector store to query, a context window to fill, a tool to call. None of them, by default, track where a memory came from, how long it has been dormant, or whether its provenance should gate its retrieval.

The Trojan Hippo results suggest three minimum requirements for any production agent with persistent memory:

Provenance tracking. Every memory entry must carry metadata about its source: user input, trusted tool, untrusted tool, or synthetic generation. Retrieval must be able to filter on that metadata.
Topic-gated retrieval. Memories from untrusted sources should not surface when the user is discussing sensitive topics unless explicitly cleared. This is not a binary “trust/untrust” flag; it is a context-dependent access control.
Quarantine of untrusted memories. A memory written from an untrusted input should enter a quarantine period, during which it is visible only under restricted conditions and subject to periodic re-evaluation or user confirmation.

None of these are free. Provenance tracking adds latency to the write path. Topic gating requires a taxonomy of sensitive topics that evolves with the domain. Quarantine breaks the illusion of seamless memory. But the alternative is an agent that recalls a poisoned email from three months ago and treats it as context for a tax-filing request.

The Bottom Line for Production Agents

The memory primitives that frameworks ship as defaults, explicit tool memory, agentic memory, RAG, long context, are a unified attack surface, not isolated features. A single untrusted input can compromise any of them, survive for months of benign use, and trigger with near-certainty when the user finally discusses something valuable. Input sanitisation and RAG filtering do not cover the write path. Only memory-layer controls with measurable utility costs do.

Builders who treat persistent memory as architecture without attack surface are building agents that remember attacks better than they remember their users. The benchmark code is public. The defenses are catalogued. The gap is now visible. What remains is whether frameworks adopt the primitives, or whether production teams bolt them on after the first exfiltration.

Frequently Asked Questions

Does Trojan Hippo work against models that already refuse prompt injections?

Yes, and the Leong defense paper documents a specific case where refusal makes things worse. The qwq:32b reasoning model achieved 0% ASR with no defense applied because it refused the payload outright, but 100% ASR when Memory Sandbox was active, removing explicit recall forced it onto the RAG pathway, where compliance-framed documents overrode its safety training. Refusal at the input layer does not protect against retrieval-layer attacks.

How does transfer between model sizes work without re-optimization?

The authors report that a payload optimized against GPT-5-mini achieved 70% ASR on GPT-5’s RAG backend without any re-tuning, suggesting the attack exploits architectural properties shared across a model family rather than model-specific vulnerabilities. This means upgrading to a larger or more capable model within the same provider family does not close the gap, a team replacing GPT-5-mini with GPT-5 for production reliability inherits the same memory-surface exposure.

Which defense should a team deploy first if No-untrusted-write breaks their use case?

User-prompt-only, restricting memory access to the current user turn, sits between the extremes. It avoids the utility loss of No-untrusted-write (agents can still ingest untrusted content) while reducing attack surface versus an undefended baseline. It is weaker than Provable policy but requires no IFC engineering. For teams whose agents handle email or support tickets, User-prompt-only is the practical starting point; Limit-memory-length is the weakest option and leaves 30% residual ASR on RAG.

What does the OpenEvolve benchmark actually test that a static eval doesn’t?

OpenEvolve adapts its attack prompts across iterations based on which ones previously succeeded, stress-testing defenses against progressively optimized payloads rather than a fixed jailbreak set. A static eval might show 0% ASR against known attack templates, while OpenEvolve iterates until it finds a variant that bypasses the defense, which is how the 30% residual ASR on Limit-memory-length was discovered. The benchmark code covers all four memory backends and all four defenses, so builders can compare their configuration against the published numbers.

Are there agent frameworks that already implement provenance tracking?

None of the major open-source frameworks, LangChain, Letta, MemGPT, CrewAI, track memory provenance by default as of May 2026. The Das et al. benchmark was released specifically because no framework ships the instrumentation needed to evaluate this attack class. Building provenance tracking requires adding source-tag metadata at the memory-write layer and filtering at retrieval, which is not a configuration change in any current framework but a code-level extension teams must build themselves.