Can You Rewind an AI Agent Mid-Run? Reversible Traces Say Yes

Yes. A June 2026 arXiv preprint called Shepherd records an agent’s execution as a reversible, Git-like trace, so a meta-agent can rewind the run to any prior step, edit the state, and replay only the suffix rather than restarting from scratch. The authors shipped a v3 revision on 2026-06-24 with a refreshed Lean-mechanized calculus.

Why does restarting an agent cost so much?

The economics of long-horizon agents are dominated by the re-run tax. When a coding agent makes a bad tool call deep into a task, overwrites a file, deletes a branch, commits to the wrong abstraction, the standard fix is to start over. The context window refills from zero, the model re-derives the plan, and you pay for every token of the offending run a second time. On a multi-hour task that penalty compounds.

The deeper issue is structural. Current agent substrates treat execution as append-only: a log you can read and attribute blame to, but not a structure you can rewind and re-drive. Observability stacks, eval harnesses, and debugging tooling are all built around that assumption. Shepherd’s contention is that append-only is the problem, not the default.

What is Shepherd, exactly?

Shepherd is a Python substrate, built at Northeastern and Stanford by Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D. Manning, and Weiyan Shi, that treats an agent’s execution as a first-class object a meta-agent can inspect, fork, revert, and rewrite (arXiv:2605.10913). The mental model is Git, applied to agent-environment state rather than source code: every model action, tool call, and environment change becomes a commit; every fork becomes a branch; any past state can be checked out as a byte-identical copy.

Under that model sit primitives grounded in functional-programming constructs (Shepherd v3). The calculus behind them is mechanized in Lean. That is not decoration. A formal substrate is what lets the authors claim byte-identical checkout and replay without hand-waving about state consistency; the proof obligation is explicit.

Can a meta-agent improve a run, or only record it?

The paper demonstrates this three ways. Each meta-agent attacks a different failure mode of long-horizon agents.

Runtime supervision. A meta-agent watches parallel coding workers and intercepts before they conflict. On CooperBench, a pair-coding benchmark, this lifts pass rate from 28.8% to 54.7% (Shepherd abstract), roughly doubling it. The mechanism matters: the supervisor forks the workers back to a pre-conflict state and re-drives them rather than merely flagging the conflict. That is undo-as-control-flow, not undo-as-notification.

Counterfactual replay. A meta-optimizer branches a run at the first commit where a proposed edit changes behavior, then replays only the affected suffix. On Terminal-Bench 2.0 it outperforms MetaHarness by 12.8% with 58% lower wall-clock time (Shepherd abstract). The earlier framing reported gains of up to 11 points across four benchmarks (Hugging Face paper page). The point of branching at the first behavior-changing commit is causal isolation: replay only the suffix the edit could have touched, and you measure the edit’s actual contribution instead of re-running noise.

Tree-RL training. A trainer forks rollouts at meta-agent-chosen turns to improve credit assignment. On Terminal-Bench 2.0, training Qwen3.5-35B-A3B with Tree-RL lifts performance from 34.2% to 39.4%, doubling GRPO’s uplift on the benchmark (Hugging Face paper page; Shepherd abstract). Reinforcement learning struggles to assign credit across long rollouts because an end-of-run reward is ambiguous about which turn earned it; branching at chosen turns gives the trainer finer-grained material to learn from.

Why does KV-cache reuse matter?

Cheap replay is the economic mechanism that matters, and it rests on two self-reported claims. Shepherd forks agent-environment state roughly 5x faster than docker commit, and reuses over 95% of the LLM provider’s KV cache on replay (Shepherd v3).

The cache number is the load-bearing one. Replay in a naive implementation means re-feeding the entire prompt history to the model and paying for every token again. If 95% of the KV cache survives a fork-and-replay, replaying the suffix costs a fraction of a fresh run rather than the full price. That is what makes the debugging workflow viable: rewind to step 47, edit the state, re-run from step 48 onward, and pay only for the suffix.

How is this different from Docker, OpenHands, or Saga-style rollback?

The rollback problem in 2026 is converging on three layers, and Shepherd sits in a different one than most of the noise.

The distinction is scope and intent. Infrastructure-layer checkpointers (Docker commits, container snapshots, filesystem copy-on-write) undo destructive actions: restore the file, roll back the container, revert the commit. Compensation patterns undo tool calls that have already hit external systems. Both answer “how do you undo one bad action in production.”

Shepherd answers a different question: how do you make the entire execution a programmable, branchable object a meta-agent can reason over. Undo-one-action restores a single state; Shepherd makes the whole run a programmable object. The existing substrates Shepherd positions itself against expose only transcripts and environment snapshots, forcing meta-agents to build ad hoc tooling rather than getting observe/fork/revert/modify as primitives (Shepherd abstract).

This is also where Shepherd separates from step-attribution research. Attribution work finds the broken step in a trace but cannot replay past it: you know which call went wrong, and then you restart anyway. Reversible traces close that gap. Find the step, rewind to it, edit the state, re-drive.

What breaks when execution stops being append-only?

A surprising amount of current tooling assumes a run is a write-once log. Observability stacks index traces forward-only. They show you what happened, not what would have happened if a state edit had landed at step 47. Eval harnesses score a run as a single linear trajectory; they have no vocabulary for “branch here, replay there, score both.” Debugging workflows are built around reading the log and restarting. All of these inherit the append-only assumption, and reversible traces quietly invalidate it.

The second-order effect lands on cost and iteration speed for agent developers. If replay is cheap, which is the whole bet, then debugging a long-horizon agent stops being a full re-run each time. You can try five counterfactual edits at step 47 and pay for five suffixes, not five complete runs. That changes how teams build, test, and tune agents. It also changes how they attribute failures: instead of “this run failed, here’s the bad step,” you get “this run failed, here’s the bad step, and here are three edits that fix the suffix.”

That is the practical claim, and it is contingent on the cache-reuse number holding up outside the authors’ setup. If 95% reuse is real and generalizes, the debugging economics shift. If it degrades on longer contexts or larger states, Shepherd is still a cleaner substrate than reinventing fork/revert per agent, but the headline cost story weakens.

Is this ready to use, or a research artifact?

Research artifact, clearly. The v3 revision landed on 2026-06-24 with a refreshed Lean-mechanized calculus and tightened benchmark claims (Shepherd v3); this is a preprint, not a shipped product. The benchmarks run against academic baselines (MetaHarness, GRPO) on standard evals (CooperBench, Terminal-Bench 2.0), and the key performance numbers are self-reported.

What travels now is the framing. The claim that agent execution should be reversible and programmable rather than append-only and log-shaped is durable regardless of whether Shepherd specifically wins. The infrastructure vendors building CoW snapshots and undo operators are converging on the same intuition from the production side. The piece that is largely unclaimed is the practitioner angle: reversible traces change the cost of debugging long-horizon agents, not just the cost of undoing a bad tool call. That is the gap this substrate is built to fill.

Frequently Asked Questions

What are Shepherd’s four primitives, and which functional-programming constructs back them?

Shepherd names them Task, Effect, Scope, and Trace, each mapped to a functional-programming construct. Task is a typed function for modifying agent behavior, Effect is an algebraic effect for observing and intercepting, Scope is a scoped handler for fork, and Trace is a persistent data structure for revert and replay. Mapping meta-agent operations onto well-studied FP constructs is also what makes the Lean mechanization tractable: the proof obligation reduces to showing these four compose, rather than verifying an ad hoc state model.

How does Shepherd differ from Replit’s snapshots or IBM STRATUS?

An April 2026 vendor-adjacent survey names the three production layers: Replit’s Snapshot Engine pairs copy-on-write filesystem checkpointing with Neon database branching, IBM STRATUS (NeurIPS 2025) adds undo operators under a Transactional No-Regression guarantee, and Saga-style compensating transactions handle state already pushed into external systems. Those layers exist to undo a single committed action. Shepherd operates one level up, treating the entire run as a branchable, programmable object rather than a checkpoint to restore.

Where does the 95% KV-cache reuse figure break down?

The number is measured on a 5.8GB image with suffixes that share most of their prefix with the original run. Reuse collapses when the forked suffix diverges early, because each divergent token invalidates the cache from that point forward, and it also depends on the LLM provider exposing KV-cache persistence through a prompt-caching API. On long contexts with early divergence, replay approaches the cost of a fresh run, which is the assumption the whole debugging-economics argument rests on.

Can replay undo a tool call that has already hit an external system?

No. Replay reconstitutes agent-environment state inside the substrate, but a tool call that wrote to an external database, sent an email, or charged a card has already committed a side effect the trace cannot reverse. That gap is what Saga compensating transactions and STRATUS-style undo operators target at the infrastructure layer; Shepherd’s revert covers the agent’s internal trajectory, not mutations already absorbed by outside systems.

Which counterfactual replay numbers should practitioners cite?

The meta-optimizer gain varies by source: 12.8% over the academic baseline on Terminal-Bench 2.0 per the abstract, but up to 27.5% across LiveCodeBench and Terminal-Bench 2.0 per the v3 HTML. LiveCodeBench itself does not appear in the abstract framing. Practitioners should cite per-benchmark figures with the source document, because the baselines, benchmarks, and wall-clock claims do not line up across the abstract, v3 HTML, and Hugging Face pages.