Hijacking AI Agent Memory: One Conversation Can Plant a Persistent Trojan

A trojan planted through conversation

A paper submitted to arXiv on May 28 demonstrates that an attacker can inject a persistent, malicious payload into an AI agent’s long-term memory using only ordinary multi-turn conversation, with no access to model weights, system prompts, or memory databases. The payload then activates in a later session, possibly against a different user. MemPoison (arXiv:2605.29960) reports attack success rates up to 95% against selective agent memory systems, breaking the assumption that the extraction and rewriting pipelines in modern agents would filter out adversarial content.

Why earlier memory-poisoning attacks stalled

Prior work on poisoning agent memory, notably AgentPoison and MINJA, operated under a convenient assumption for attackers: that user inputs get stored verbatim. Real agent systems do not work that way. Production agents run selective extraction pipelines that decide what to remember, then rewriting pipelines that compress or rephrase stored memories before committing them. These filters shredded the naive injection strategies of earlier attacks, which depended on writing raw adversarial strings directly into a memory store. As a result, prior approaches achieved negligible success rates against any agent with a non-trivial memory pipeline, according to the MemPoison paper’s analysis of baselines.

MemPoison’s contribution is solving for that pipeline. The attack is entirely black-box: the attacker interacts with the agent through normal dialogue and crafts inputs that survive both extraction and rewriting, landing in long-term memory as coherent, retrievable entries.

Three mechanisms, one optimization loop

MemPoison works by jointly optimizing three components that target distinct failure points in the memory pipeline.

Semantic relational bridging binds the trigger phrase and the malicious payload into a single coherent statement. The goal is to make the memory extraction step store them as one unit rather than splitting them. If the extraction model sees a factually coherent sentence, it preserves it intact. The attacker’s trigger and payload become inseparable at the storage boundary.

Entity masquerading disguises the trigger as a named entity, a person, product, or location, so that the rewriting model preserves it verbatim during compression. LLMs treat named entities as atomic tokens that should not be paraphrased. By wrapping the trigger in entity-like syntax, the attack exploits that preservation instinct to keep the trigger string unmodified through the rewrite stage.

Joint embedding optimization clusters the poisoned memory entries tightly together in the embedding space used for retrieval, while pushing them away from benign entries. The mechanistic analysis in the paper shows this exploits embedding-space anisotropy, the tendency of transformer-derived embeddings to cluster along dominant directions. Poisoned entries that occupy a tight, isolated cluster bias the retrieval step toward selecting them when any sufficiently nearby query arrives, according to the paper’s mechanistic analysis.

Evaluation under controlled conditions

Across diverse agent domains and memory mechanisms, MemPoison achieves attack success rates up to 0.95 while maintaining benign-task accuracy, significantly outperforming prior baselines. The paper evaluates against multiple agent architectures and reports consistent results. Section 5.5 of the paper tests the attack on a real-world agent system, not only synthetic benchmarks.

The attack succeeds through two activation pathways. In the user-triggered path, the trigger is embedded in external content, a webpage, a document, that an innocent user asks the agent about. The victim has never interacted with the attacker. In the attacker-triggered path, the attacker directly issues a query containing the trigger in a later session. The user-triggered variant is the more consequential one operationally, because it means the blast radius extends beyond anyone who chose to interact with the attacker, per the paper’s threat model.

Why the obvious defenses do not work

The paper tests two natural defenses: perplexity-based filtering and paraphrasing. Perplexity filters flag inputs whose token-level probability under a language model is unusually low, a common heuristic for detecting adversarial or garbled text. MemPoison evades this because the semantic bridging step produces fluent, coherent sentences. The poisoned entries are not anomalous in terms of language model probability; they are anomalous only in terms of what they cause the agent to do downstream.

Paraphrasing, rewording stored memories to strip adversarial structure, fails because entity masquerading protects the trigger. A paraphrase model that preserves named entities will preserve the disguised trigger by design, according to the defense evaluation in the paper.

The authors argue that these defenses are structurally insufficient because they operate on surface-level text properties while the attack operates on embedding-space geometry. Filtering text that looks normal and rewriting text that was designed to survive rewriting is a losing proposition without addressing the retrieval mechanism itself.

The production threat is not hypothetical

A related attack, eTAMP (arXiv:2604.02623), demonstrates cross-session, cross-site memory poisoning through malicious webpages rather than direct conversation. eTAMP achieves a 32.5% success rate on GPT-5-mini under environmental stress conditions, reported as an 8x amplification over baseline. The vector is different (webpage injection rather than dialogue), but the outcome is the same: a durable payload in persistent memory that fires against a different user in a different context.

The eTAMP work explicitly names ChatGPT Atlas and Perplexity Comet as targets. These are AI browsers that maintain cross-session memory while ingesting untrusted web content. The attack surface is not a lab curiosity; it is the default operating mode of shipping products.

Together, MemPoison and eTAMP cover the two main input channels for persistent-memory agents: conversation and web content. Neither channel has a robust, default-on defense in production deployments as of mid-2026.

What to do about it

The practical takeaway is that persistent agent memory is security infrastructure, not a convenience feature. Three defensive primitives address the known attack surface:

Provenance tagging. Every entry committed to memory should carry metadata about its origin: which user session created it, what input channel it came from, whether the source was an authenticated user or anonymous content. Without provenance, there is no way to audit which entries an attacker planted, or to scope a cleanup operation after a discovered compromise.

Write-rate limits. MemPoison’s embedding optimization works best when the attacker can plant multiple poisoned entries that form a tight cluster. Rate-limiting how many new memories a single session can create, or how many memories can be committed per unit time, raises the cost of clustering attacks. It does not prevent single-entry payloads, but it degrades the optimization strategy that drives the highest success rates.

Friction-aware commit thresholds. Memory systems that commit aggressively, storing anything the extraction pipeline passes, are maximally vulnerable. Adding friction in the form of cross-referencing new entries against existing memories, flagging entries that shift retrieval patterns anomalously, or requiring higher confidence before committing content from low-trust sessions, all reduce the attack surface at the cost of making the agent slightly less responsive to new information. The defense paper’s finding that pre-existing legitimate memories reduce attack effectiveness suggests that systems with richer, well-established memory stores are inherently harder to poison, which is an argument for conservative commit policies that favor durable, high-confidence entries.

None of these are drop-in solutions. The defense paper’s warning about threshold calibration applies to all of them. But operating a persistent-memory agent without any of them, in mid-2026, means running a system where a single conversation can plant a backdoor that outlives the session, survives the memory pipeline, and fires against users who never interacted with the attacker. That is the threat model now.

Frequently Asked Questions

Are freshly deployed agents more vulnerable than ones with established memory stores?

Yes. The defense study at arXiv:2601.05504 found that pre-existing legitimate memories dramatically reduce poisoning effectiveness. A newly shipped agent with a sparse memory bank gives poisoned entries less competition during retrieval, so the clustering strategy that MemPoison relies on is more effective against early-stage deployments than mature ones with hundreds of legitimate entries.

Why does eTAMP achieve 32.5% success compared to MemPoison’s 95%?

eTAMP operates under a stricter constraint: it must poison memory through a single webpage load rather than a multi-turn conversation, giving the attacker far less control over how the extraction pipeline processes the payload. The 32.5% figure is also measured on GPT-5-mini under environmental stress conditions, a harder target than the controlled evaluation environments used for the MemPoison benchmarks. The two numbers are not directly comparable on method alone; the cross-site, single-interaction vector is inherently noisier than sustained dialogue.

What does temporal decay in memory sanitization cost the agent?

Temporal decay reduces the retrieval weight of older entries over time, so a poisoned entry planted weeks ago gradually becomes inert. The cost is that legitimate but infrequently accessed memories (a user’s project preferences, a rare technical constraint they mentioned once) also decay, forcing the agent to re-learn information it already knew. The defense paper warns that threshold calibration is the hard part: decay too fast and the agent forgets useful context, decay too slow and the window for payload activation stays open.

Can a poisoned memory be removed once an attack is discovered?

Removing a single poisoned entry does not necessarily break the attack. MemPoison’s joint embedding optimization plants multiple entries that form a tight cluster in the retrieval space, so deleting one may leave enough cluster density for the trigger to still fire. Without provenance tags on every memory entry, operators cannot identify the full set of entries planted during a single attack session, forcing a choice between targeted removal (risking incomplete cleanup) and flushing the entire memory store (losing all legitimate user context).