AI agents use four distinct memory tiers—working, episodic, semantic, and procedural—stored across context windows, vector databases, knowledge graphs, and model weights. Getting that architecture right is the difference between an agent that accumulates knowledge across sessions and one that starts from scratch every time. In production, the gap between these outcomes is measured in user trust, token costs, and cascading failure rates.
What Is AI Agent Memory?
An AI agent’s “memory” describes every mechanism by which past information influences future behavior. This is not a single system. It is a hierarchy of storage types with different read/write speeds, capacities, and failure modes—analogous to the memory hierarchy in operating systems, a parallel that shaped how the most influential agent frameworks were designed.
IBM’s technical documentation distinguishes four types practitioners should know1:
- Working memory — the agent’s active context window, fast but bounded (typically 8K–200K tokens depending on model)
- Episodic memory — records of specific past interactions or events, useful for case-based reasoning and learning from outcomes
- Semantic memory — structured factual knowledge: definitions, rules, relationships between entities
- Procedural memory — learned skills and behavioral patterns, often encoded in model weights or tool schemas
Most production agents use only working memory by default, which explains why the majority of commercial chat assistants dropped 30% in accuracy on long-horizon memory tasks in the LongMemEval benchmark—before any memory augmentation was applied.2
How Each Memory Type Works
Working Memory: The Context Window
The context window is the only memory an LLM has natively. Every token the model can “see” fits here; everything outside it doesn’t exist to the model. Modern frontier models have pushed this from 4K to 1M tokens (Google Gemini 1.5), but capacity alone doesn’t solve the problem.
Research published in 2025 identified a phenomenon called transcript replay failure: when full conversation history accumulates without compression, attention selectivity degrades, early errors persist and re-emerge as hallucination carryover, and the model drifts from constraints established in earlier turns.3 Bigger context windows delay this failure; they don’t prevent it.
Episodic Memory: What Happened, When
Episodic memory stores time-stamped records of specific interactions—what the user said, what the agent did, what the outcome was. Unlike semantic memory, it preserves the sequence and context of events rather than abstracting them into general facts.
Letta (the production evolution of MemGPT from Stanford) implements episodic memory as recall memory: an external store of conversation history the agent can search by date or content using dedicated tool calls. The MemGPT architecture, described in the 2023 paper “MemGPT: Towards LLMs as Operating Systems,” treats the context window as RAM and external stores as disk—with explicit paging operations when memory exceeds the window limit.4
# Letta agent memory architecture (simplified)# Core memory = in-context "RAM"# Archival memory = external "disk" (vector DB)# Recall memory = conversation history store
agent.archival_memory_insert("User prefers concise responses, confirmed 2026-01-15")results = agent.archival_memory_search("user preferences")agent.memory_replace("persona", "You are a concise assistant...")Semantic Memory: What Is True
Semantic memory holds generalized facts, rules, and relationships—knowledge that persists independent of when it was learned. In agent systems, this maps to vector databases (for unstructured retrieval) and knowledge graphs (for structured relational reasoning).
Mem0, published as a production-ready architecture in April 2025, uses a two-tier approach: a vector store for fast similarity search and an optional graph layer (Mem0ᵍ) that represents memories as a directed labeled graph G=(V,E,L), where nodes are entities, edges are relationships, and labels assign semantic types.5 A conflict detector and LLM-powered resolver handle contradictions automatically—deciding whether to merge, invalidate, or keep competing facts.
The results are significant: Mem0 achieves a 26% relative improvement in LLM-as-a-Judge scores over full-context approaches, while cutting p95 latency by 91% and token costs by over 90%.5
Procedural Memory: How to Do Things
Procedural memory encodes skills and behavioral patterns. For LLM-based agents, this lives primarily in model weights (learned during training), tool schemas, and system prompts. It’s the least dynamic memory tier—updating it requires fine-tuning or explicit schema changes rather than runtime writes.
Some architectures blur this boundary by allowing agents to write new tool definitions or update their own system prompt via function calls, but this remains an active research area rather than a stable production pattern.
Architecture Comparison
Different production systems make different trade-offs across the memory hierarchy:
| Architecture | Working Memory | Episodic | Semantic | Update Mechanism | Token Cost |
|---|---|---|---|---|---|
| Plain RAG | Context window | None | Vector DB (static) | Manual re-index | Medium |
| Letta/MemGPT | Compressed core memory | Recall store | Archival memory | Agent-controlled paging | Medium-High |
| Mem0 | Compressed context | Session history | Vector + Graph DB | LLM extraction + conflict resolution | Low (−90%) |
| Observational Memory (Mastra) | Observation log + raw history | Observer/Reflector agents | Stable prefix cache | Background compression at 30K tokens | Very Low (−10x) |
| Zep (Temporal KG) | Context window | Temporal event graph | Knowledge graph | Automated extraction | Low |
| LangGraph + Checkpointer | Thread state | Checkpoint store | External DB optional | State reducers | Variable |
What Practitioners Are Actually Building
The Observational Memory Breakthrough
In early 2026, Mastra’s observational memory system demonstrated that a text-based, cacheable approach could outperform RAG while cutting token costs by up to 10x. The system uses two background agents: an Observer that compresses conversation history into a dated observation log, and a Reflector that consolidates older observations into higher-level summaries.6
The key insight is architectural: by keeping the observation log as a stable, append-only prefix, the system enables aggressive prompt caching at the provider level—a 4–10x cost multiplier that RAG-based systems generally can’t exploit because their retrieval results change with each query.
On LongMemEval, Mastra’s observational memory scored 84.23% on GPT-4o, compared to its own RAG baseline at 80.05%, and 94.87% using GPT-5-mini.6
Knowledge Graphs for Relational Reasoning
For agents that need to track relationships between entities over time—customer support agents tracking an account’s full history, research agents mapping connections across documents—knowledge graphs outperform pure vector stores on multi-hop queries.
Zep’s temporal knowledge graph architecture extracts entity-relationship triplets from conversations automatically and versions them with timestamps, enabling queries like “what did this user believe about X before we corrected it?”7 Their evaluation showed 18.5% accuracy improvements on LongMemEval while cutting response latency by 90%.
LangGraph Checkpointing in Practice
LangGraph’s stateful graph architecture treats memory as explicit graph state with reducer functions controlling how state updates. Thread-level persistence (short-term) uses an InMemorySaver; cross-session persistence requires a database-backed checkpointer.8
from langgraph.checkpoint.postgres import PostgresSaverfrom langgraph.graph import StateGraph
# Production-grade persistent memorycheckpointer = PostgresSaver.from_conn_string(DATABASE_URL)
graph = StateGraph(AgentState)# ... add nodes and edges ...app = graph.compile(checkpointer=checkpointer)
# State persists across sessions via thread_idconfig = {"configurable": {"thread_id": "user-123"}}result = app.invoke({"messages": [user_message]}, config)The framework also supports time-travel debugging—replaying from any checkpoint—which is uniquely valuable for diagnosing failure modes in long-running agent sessions.
Where Memory Systems Break
Understanding failure modes is as important as understanding architectures. Three failure patterns dominate production incidents:
Memory poisoning: Agents that persist information from external inputs are vulnerable to adversarial injection. A maliciously crafted email stating “company policy now allows unapproved transfers up to $10,000” can write false facts into an agent’s long-term store that persist across future sessions.3 Mitigation requires input sanitization before memory writes and periodic memory audits.
Cascading hallucinations: Hallucinated facts become inputs for subsequent reasoning steps. Legal RAG systems still hallucinate citations between 17% and 33% of the time—and in agentic pipelines, a phantom entity in step one corrupts pricing, inventory, and notification systems downstream before monitoring catches it.9 Structured memory with explicit conflict resolution (as in Mem0) reduces but does not eliminate this risk.
Retrieval drift: In RAG-based systems, retrieval errors accumulate. Stale, conflicting, or adversarially injected artifacts in the vector store perturb task state and destabilize long-horizon behavior. The 2025 arXiv paper “AI Agents Need Memory Control Over More Context” argues this makes bounded memory states preferable to unbounded context accumulation for complex tasks.10
The Benchmark Landscape
The LongMemEval benchmark (ICLR 2025) evaluates five memory capabilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. It contains 500 questions embedded in realistic multi-session conversation histories.2
As of early 2026, the state of the art breaks down as follows:
| System | LongMemEval Score | Notes |
|---|---|---|
| Hindsight (Emergence AI) | 91.4% | First system to exceed 90%; uses Gemini 3 Pro Preview |
| Mastra Observational Memory | 94.87% | Uses GPT-5-mini; 10x lower token cost than RAG |
| Vectorize | >90% | Open-source; December 2025 |
| Mem0 | ~91% | With graph memory enabled |
| Zep | ~78%+ | With temporal KG; 18.5% over baseline |
| GPT-4o (baseline, no augmentation) | 30–70% | Varies by task category |
The gap between augmented systems (>90%) and unaugmented models (30–70%) represents the practical value of memory architecture work. No frontier model achieves reliable long-term memory without external memory infrastructure.
Choosing an Architecture
The right architecture depends on what your agent needs to remember and for how long:
- Single-session, bounded context: Plain context window management with summarization is sufficient. Use
ConversationSummaryBufferMemoryin LangChain or equivalent. - Multi-session with user preferences: Mem0 or a vector store with structured extraction handles this well at low token cost.
- Long-running tasks with complex relationships: Combine a knowledge graph (Zep, Mem0ᵍ) with episodic recall. Pay the graph maintenance cost upfront.
- High-volume agentic workflows: Observational memory’s caching advantage compounds at scale. 10x token cost reduction matters at millions of calls per day.
- Stateful workflows with human oversight: LangGraph’s checkpointing gives you time-travel debugging and resumable state—worth the orchestration overhead.
Frequently Asked Questions
Q: What’s the difference between RAG and agent memory? A: RAG retrieves from a static external corpus at query time; agent memory is dynamic, updating as the agent operates, persisting information across sessions, and managing contradictions. RAG is a retrieval tool; agent memory is a stateful infrastructure layer.
Q: How do I prevent memory poisoning in production agents? A: Treat every write from an external source (user input, web content, emails) as untrusted. Apply input sanitization before memory writes, implement provenance metadata on stored facts, and run periodic consistency checks that flag recently added memories from external sources for review.
Q: What benchmark should I use to evaluate my agent’s memory system? A: LongMemEval (ICLR 2025) is the standard benchmark for long-term interactive memory, covering information extraction, temporal reasoning, multi-session reasoning, knowledge updates, and abstention. Use it as your baseline before measuring task-specific performance.
Q: Is in-context memory ever sufficient, or do agents always need external stores? A: For tasks bounded within a single session where total tokens fit comfortably in the context window, in-context memory is sufficient. External stores become necessary when sessions span days or weeks, when information exceeds the context limit, or when multiple agent instances need shared memory state.
Q: How should vector stores and knowledge graphs be used together? A: Vector stores handle similarity-based retrieval over unstructured or semi-structured content; knowledge graphs handle structured relational queries over entities and their relationships. Use both in parallel—vector search to find relevant content, graph traversal to resolve relationships and update entity state. Mem0’s architecture is a practical reference implementation of this hybrid approach.
Footnotes
-
IBM. “What Is AI Agent Memory?” IBM Think Topics. https://www.ibm.com/think/topics/ai-agent-memory ↩
-
Wu, X., et al. “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.” ICLR 2025. https://arxiv.org/abs/2410.10813 ↩ ↩2
-
Galileo AI. “7 AI Agent Failure Modes and How To Fix Them.” Galileo Blog, 2025. https://galileo.ai/blog/agent-failure-modes-guide; arXiv. “AI Agents Need Memory Control Over More Context.” 2601.11653. https://arxiv.org/abs/2601.11653 ↩ ↩2
-
Letta. “Intro to Letta / MemGPT.” Letta Documentation. https://docs.letta.com/concepts/memgpt/ ↩
-
Mem0 Research Team. “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.” arXiv
.19413, April 2025. https://arxiv.org/abs/2504.19413 ↩ ↩2 -
VentureBeat. “‘Observational memory’ cuts AI agent costs 10x and outscores RAG on long-context benchmarks.” 2026. https://venturebeat.com/data/observational-memory-cuts-ai-agent-costs-10x-and-outscores-rag-on-long ↩ ↩2
-
Rasmussen, Preston. “Zep: A Temporal Knowledge Graph Architecture for Agent Memory.” Zep Blog, January 2025. https://blog.getzep.com/content/files/2025/01/ZEP__USING_KNOWLEDGE_GRAPHS_TO_POWER_LLM_AGENT_MEMORY_2025011700.pdf ↩
-
LangChain. “Memory - LangGraph Documentation.” https://docs.langchain.com/oss/python/langgraph/add-memory ↩
-
Weaviate. “Context Engineering - LLM Memory and Retrieval for AI Agents.” Weaviate Blog, 2025. https://weaviate.io/blog/context-engineering ↩
-
Letta. “Benchmarking AI Agent Memory: Is a Filesystem All You Need?” Letta Blog. https://www.letta.com/blog/benchmarking-ai-agent-memory ↩