How On-Device AI Agents Keep Learning by Forgetting on Purpose

Yes, but only by forgetting on purpose. The CURATOR paper, posted to arXiv on 23 June 2026, shows a frozen on-device language-model agent that keeps learning new tasks while cutting its memory footprint by 2.7× and dropping a prompt-injection attack’s success rate from 0.75 to zero, all without reducing accuracy. The agent’s weights never change; what changes is which experiences it keeps. (arXiv:2606.25115)

How does CURATOR redefine continual learning for on-device agents?

Continual learning used to mean updating weights without catastrophically forgetting old tasks. CURATOR redefines it as memory curation: the agent is a frozen LLM planner calling fixed low-level primitives, so adaptation “moves from parameter updates to memory access.” (arXiv:2606.25115) For a tool-using agent, the claim is that you do not need to touch the model at all. You need to decide which stored traces survive the next round.

That reframing matters because on-device agents cannot afford the parameter-update loop. LoRA-style adapters and replay buffers ship gradients, and their compute and energy costs, to whatever chip is running the agent, and the round-trips back to a server are exactly what an on-device deployment is trying to avoid. CURATOR’s bet is that a well-governed memory store substitutes for most of what fine-tuning bought, at a fraction of the cost.

CURATOR scores every stored memory entry by net value per byte: expected benefit minus expected harm (including provenance risk), normalized by the storage it consumes. That single ruler then governs three decisions. KEEP evicts low-value bytes when RAM or energy pressure rises. SHARE sends an insight to a peer only when its value exceeds the uplink cost of transmitting it. TRUST gates incoming peer entries by their provenance before admitting them. (arXiv:2606.25115)

The unification is the point. Three engineering problems that usually get three separate heuristics (an LRU cache, a sync policy, an allowlist) collapse into one value computation. That also makes the tradeoffs visible: a byte kept is a byte not available for something else, and a byte shared costs real bandwidth.

What numbers does the paper report?

Across language-model-agent task-drift benchmarks and a real heterogeneous Jetson testbed with two robot-arm nodes and a hub, CURATOR reduces memory by 2.7× and uplink by 2.4×, drives injection success from 0.75 to zero, and raises accuracy on cases corrupted by poison or stale memory. (arXiv:2606.25115) The paper frames the combined result as reducing footprint, energy, uplink, and injection success together without reducing accuracy.

The security numbers are the steeper part of the curve. CURATOR drove injection attack success from 0.75 to zero, and the paper isolates why: a value-only filter leaves the attack success rate high because poisoned entries look useful, and only the provenance term pushes it down. (arXiv:2606.25115)

Why would forgetting improve an agent?

Because on a frozen backbone, the only thing that can go wrong with memory is keeping the wrong bytes. A raw trajectory kept verbatim can carry negative value on hard cases, which is why CURATOR distills such traces into abstract insights and scores them by net value rather than retaining them as-is. (arXiv:2606.25115) Forgetting, in this framing, is not loss. It is the compression step that turns a bloated replay into a useful rule.

This is the result that is easy to misread. The “forgetting improves accuracy” headline is conditional on two things the paper is explicit about: the planner is frozen, and the score includes provenance. Strip either condition and the win disappears. A team that fine-tunes weights, or that keeps everything by recency or raw success, will not reproduce it.

Why is provenance load-bearing?

Because usefulness is exactly what a poisoned entry mimics. A larger memory widens the attack surface rather than closing it: every stored entry the agent writes is a place where an attacker can plant something the retrieval path will surface. The naive intuition that more memory means more robustness is backwards here.

That is why CURATOR’s TRUST gate is not a nice-to-have. A value-only score ranks a well-crafted malicious entry near the top, because malicious entries are designed to look maximally useful. Only tracking where an entry came from, and penalizing untrusted provenance, is what lets CURATOR drive the attack success rate from 0.75 to zero. (arXiv:2606.25115) Any production port that drops the provenance term for latency inherits the 0.75 attack success rate as a baseline.

How does CURATOR sit next to BudgetMem, All-Mem, and Metis?

CURATOR is one answer in a crowded June 2026 cluster, and the answers differ on a real axis: do you evict, tier, recover, or split your memory? CURATOR evicts by net value. BudgetMem (accepted to ICML 2026) instead structures runtime memory into modules each offered in three tiers (Low/Mid/High), with a compact RL-trained neural router performing query-aware budget-tier routing across LoCoMo, LongMemEval, and HotpotQA. All-Mem casts lifelong memory as a budgeted maintenance-and-recovery problem with non-destructive Split/Merge/Update operators on a topology graph, preserving immutable evidence for traceability. Metis shows text-memory and code-memory have complementary trade-offs; on AppWorld its hierarchical dual-representation memory improved task accuracy up to 20.6% over ReAct while cutting execution cost up to 22.8%.

System	Governance strategy	What it optimizes
CURATOR	Net-value eviction	Footprint vs. accuracy, with provenance as the security gate
BudgetMem	Query-aware tier routing	Per-query cost via three-tier modules
All-Mem	Non-destructive topology recovery	Traceability and reversibility over compactness
Metis	Dual text/code representation	Accuracy and cost via complementary stores

CURATOR’s distinctive move is that eviction and trust are the same decision. BudgetMem and All-Mem both keep everything and spend compute deciding what to surface; Metis changes the representation, not the retention policy. CURATOR is the only one that throws bytes away by net value and treats that act as both a cost win and a security win.

A complementary thread comes from the cognitively-grounded value model in arXiv:2606.12945, which scores memory across seven factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, usage history). On LongMemEval it retained 0.770 of gold evidence versus 0.368 for recency. (arXiv:2606.12945) That result sits on a different benchmark and is not directly comparable to CURATOR’s footprint-accuracy numbers, but it reinforces the same underlying claim: recency is a poor proxy for value, and a richer value function recovers more signal per byte.

Why does on-device adaptation change personalization economics?

Because the adaptation loop is the expensive loop, and CURATOR keeps it off the network. Every round of server-side fine-tuning is a round-trip: data off the device, gradients computed in a datacenter, weights shipped back. CURATOR’s per-round uplink cut of 2.4× (arXiv:2606.25115) moves the cost of personalization onto the device rather than the network. If the device curates its own memory, the server’s role shrinks from adaptation provider to model-and-primitive supplier.

This lands hard against the edge’s actual constraints. On an Apple M4 Pro with a 10.2 GB cache budget, only three agents fit at 8K context in FP16, and Q4 persistence reduces time-to-first-token by up to 136×. (arXiv:2603.04428v1) Memory budgeting on the edge is not an optimization; it is the difference between running one agent and running several. A governance rule that holds accuracy while cutting footprint by 2.7× is, in that regime, the difference between a personalization feature shipping or not.

There is a second-order data consequence. If adaptation happens on-device and the raw traces never leave, the privacy story is straightforward, but so is a new failure mode: the same budget pressure that improves accuracy forces an explicit forgetting decision, and there is no longer a server-side audit log of everything the agent ever saw. Forgetting becomes a security and compliance property, not just an efficiency one.

What are the limits and open questions?

The headline numbers are bounded by their setup. CURATOR’s 2.7× memory and 2.4× uplink cuts are on task-drift benchmarks and one Jetson testbed, not on phone-scale workloads with real user data. (arXiv:2606.25115) The “forgetting improves the agent” result holds only for a frozen planner; the paper does not show what happens if the backbone is also being fine-tuned, and the interaction between memory curation and weight updates is an open question. The comparison to BudgetMem, All-Mem, Metis, and the seven-factor value model is apples-to-oranges, since each measures retention on a different suite.

The honest read is that CURATOR makes a specific, narrow claim extremely well: for a frozen tool-using agent under a hard memory budget, net-value eviction with a provenance term beats keep-all on footprint, energy, uplink, accuracy, and security simultaneously. The broader claim, that this is how all on-device agents should learn forever, is what the next paper has to prove.

Frequently Asked Questions

How few poisoned entries does it take to compromise a keep-all agent?

In retrieval-augmented memory, as few as five crafted entries can raise the injection attack success rate to 90 to 99 percent, and a larger store widens the attack surface rather than closing it. Keep-all retention is itself the vulnerability that CURATOR’s net-value eviction plus provenance term is engineered against.

How does CURATOR’s accuracy compare to a full-memory oracle?

On the task-drift benchmarks CURATOR reaches 97 percent of a full-memory oracle’s accuracy at 37 percent of its footprint, alongside the 2.4× per-round uplink cut. That figure is measured on CURATOR’s own suite, separate from BudgetMem’s tier-routing scores on LoCoMo or All-Mem’s recovery metrics.

What is the quantified payoff from distilling a raw trajectory before storing it?

The authors report that a raw trajectory kept verbatim carried a net contribution of minus 26.1 on hard cases, and distilling the same trace into an abstract insight moved it to plus 3.3. That swing is the mechanism behind the forgetting-improves-accuracy result, and it is why CURATOR scores distilled insights rather than retaining transcripts.

Has CURATOR been validated beyond single-agent task-drift benchmarks?

The hardware validation is a heterogeneous Jetson testbed with two robot-arm nodes and a hub, where measured energy, memory, and uplink each fell to between 0.38 and 0.64 of the keep-all baseline under the same governance rule. That is a controlled robotics rig rather than a phone-scale multi-user deployment, so consumer-device personalization claims remain unproven.