groundy
agents & frameworks

Can You Pinpoint Which Step Broke a Long-Horizon AI Agent?

SAFARI probes long agent runs to localize the failing step, beating prior methods by 20% and shifting triage cost from engineer hours to inference spend.

8 min · · · 4 sources ↓

Barely. On the Who&When benchmark, the best prior method identifies which agent caused a failure 53.5% of the time but pinpoints the exact failing step just 14.2% of the time, and OpenAI’s o1 and DeepSeek’s R1 never reached practical usability. A workshop paper posted to arXiv yesterday, SAFARI, argues the bottleneck is not model intelligence but the assumption that an operator should read the whole trace in the first place.

Why can’t you just read the whole trace?

You can, until the trace stops fitting. Multi-step agent trajectories now routinely exceed even the 1M-token context windows that frontier models expose, at which point the log physically cannot sit in the context you would use to inspect it. The standard debugging posture, loading the full trajectory into an LLM and asking it to find the bad step, breaks on exactly the runs most likely to fail: long runs are long because the agent took many steps, and long runs are where failures cluster.

The passive-inspection frame is brittle for a second reason, and it is structural rather than architectural. HORIZON’s analysis of 3,100-plus trajectories across four representative agentic domains, run on GPT-5 variants and Claude models, finds that long-horizon failure is a different kind of failure, not merely a lower success rate. As the horizon grows, failure modes shift and per-step error rates compound across dependent steps. The implication for anyone reading a log is uncomfortable: the cause is not necessarily the last step. An early error can propagate across many dependent steps, and reading the trace linearly biases you toward the end, where the symptoms are loud but the originating error need not be.

Most agent-debugging tooling shipped in the last two years operates on the same passive-inspection assumption: instrument the run, capture the trace, hand it to a human to read. These stacks make traces easier to collect and prettier to scan. They do not change the economics. The human is still doing the reading, and on a run past the context window the human is the hard ceiling on triage time. SAFARI’s bet is that the reading itself should be automated, not merely the collection.

How does SAFARI’s investigation loop work?

SAFARI, published at the Second Workshop on Agents in the Wild at ICML 2026 (arXiv:2606.24626, submitted 23 June 2026 by Chenyang Zhu, Jiayu Yao, Kushal Chawla, Youbing Yin, and Erin Babinsky), swaps the load-the-whole-trace posture for a tool-augmented diagnostic loop. An LLM is given a specialized toolbox for reading and searching segments of the trajectory rather than the full text, plus a persistent Short-Term Memory that carries reasoning across turns. The model iterates: pull the segments it suspects, update the hypothesis, probe again.

The design choice worth flagging is that diagnostic accuracy is decoupled from the model’s architectural context limit. A run can be ten times the context window and the model never holds it whole, because it does not need to. That is the move. Everything else is implementation detail.

The shape will be familiar to anyone watching adjacent fault-localization work. A separate effort, LLM4FL, applies the same investigation pattern to ordinary code bugs: a three-agent Context-Extraction, Debugger, and Reviewer pipeline with graph-RAG navigation beat AutoFL by 18.55% Top-1 on Defects4J. The convergence is worth pausing on. Both code-fault and agent-trace diagnosis are moving away from “give the model everything” toward “give the model the tools to look.” That is the same arc that turned passive log readers into interactive debuggers in traditional software, and it lands on roughly the same tradeoff.

What did SAFARI actually score?

On two fixed-budget benchmarks, SAFARI clears prior state-of-the-art by 20% on Who&When (1M-token budget) and 19% on the TRAIL GAIA subset (25K-token budget), per the paper. The more revealing figure is precision in the overflow regime: SAFARI holds 0.58 precision when the target fault sits five times beyond the model’s native context window, the regime where a full-context evaluator fails outright because the trace will not fit.

How hard is the baseline it beats?

Hard enough that “20% over SOTA” needs the floor it stands on. Who&When, built from failure logs of 127 LLM multi-agent systems, exposes a sharp gap between two tasks that sound alike. The best prior method reaches 53.5% accuracy identifying the failure-responsible agent but only 14.2% pinpointing the failing step, and some methods score below random. Naming who broke it is nearly four times easier than naming the step where it broke.

Attribution taskBest prior accuracy
Identify the responsible agent53.5%
Pinpoint the failing step14.2%

That gap is why SAFARI is a paper at all. If agent-level attribution were the hard part, the field would already sit at a passable 53%. The step-level ceiling at 14% is where the work actually is, and it is also why o1 and DeepSeek R1 never reached practical usability: a reasoning model strong on the agent-level question does not automatically close the step-level gap. SAFARI’s contribution has to be read against that 14% floor, not against 53%.

When does automated root-causing beat a senior engineer reading the log?

The paper does not answer this directly, but its framing forces the question, and it is the one a buyer should ask before any benchmark number. Passive observability scales the human cost roughly linearly with trace length. A 200-step run takes a senior engineer materially longer to triage than a 20-step run, and a run past the context window takes the human just as long as it takes the model, because nobody is loading it into one window anyway. SAFARI’s loop shifts that cost off the operator and onto inference spend. The model pays in tokens to do the reading the human would otherwise do.

Whether the inference bill is cheaper depends on three things the abstract does not pin down. First, volume: how many failed runs you triage per week, because the investigation setup amortizes across many runs and does not across one. Second, trace length: longer traces widen the human-time gap faster than they widen the inference gap, since the human reads sequentially and the model pays per token probed. Third, the seniority of the human you are displacing, because that is the hourly rate the inference spend is being measured against. For a team fielding many failed long-horizon runs weekly, a loop that localizes the step at 0.58 precision is almost certainly cheaper than pulling a senior engineer off other work to read a million tokens. For a one-off failure on a short trace, it almost certainly is not.

What breaks when no single step is to blame?

The gap SAFARI does not address, and the one a skeptical reader should hold it to: the entire framing assumes the failure has a localizable culprit, one action at one step, identifiable by investigation. Plenty of agent failures resist that framing. A planning choice that compounds across fifty steps, an early error that silently corrupts state and only surfaces much later, a context-driven misstep that is nobody’s single fault. These are diffuse, multi-cause failures with no responsible step, only a responsible process.

HORIZON’s compounding-error result is what lifts this beyond a hypothetical. The benchmark finds that long-horizon degradation is “not merely additive,” that even a small per-step error rate compounds across dependent steps into near-systematic failure. If errors propagate that way, then the “failing step” a fault-attribution method returns may be the symptom rather than the originating cause, and single-step attribution is already an approximation. A 0.58 precision figure in the overflow regime tells you SAFARI is good at the failures that have a clean culprit. It does not tell you how it behaves on failures whose cause is distributed across many steps, and the abstract is silent on it.

A definitional question sits underneath. “Attributable” assumes a step can be blamed. For a failure driven by accumulated drift, the defensible answer is that no single step is to blame and the system’s horizon itself is the bug. A fault-attribution method contracted to return a step will return one anyway, and a precision score cannot tell you whether that step is the cause or merely the nearest plausible one. Before any of these numbers get quoted as a capability, the gap between “attributable” and “true cause” is the one to press on.

Frequently Asked Questions

How does SAFARI differ from LangSmith or Arize for agent debugging?

LangSmith, Arize, and Helicone are observability stacks that capture and render agent traces for a human to read; SAFARI automates the reading itself by handing an LLM the tools to probe the trace. They sit at different layers, since SAFARI’s investigation loop consumes the same trace data those platforms collect rather than replacing them.

Does SAFARI apply to SWE-bench or τ-bench agent runs?

Not directly. SWE-bench and τ-bench are capability benchmarks that score whether an agent solved a task, not where it failed. SAFARI is measured on Who&When and TRAIL GAIA, which ship failure-attribution ground truth, so diagnosing a SWE-bench run would require constructing that attribution label yourself.

What does the TRAIL GAIA subset test that Who&When does not?

GAIA is a general-assistant benchmark of real-world reasoning tasks, and TRAIL wraps those tasks in agent-trajectory form, so the subset tests fault attribution on single-agent runs under a tight 25K-token budget. Who&When, built from 127 multi-agent system logs, stresses the inter-agent blame problem instead of deep single-agent traces.

What failure regime is Who&When itself biased toward?

Who&When is built from failure logs of 127 multi-agent systems, and its ground truth presumes each failure has an attributable agent and a specific failing step. That makes it a clean-culprit benchmark by construction, so SAFARI’s strong scores there do not certify how it behaves on failures where blame is genuinely distributed and no single step is the cause.

If a frontier model doubled its context window tomorrow, would SAFARI’s approach still matter?

Yes, because fitting the trace is only part of what the investigation loop buys you. The structural problem is separate: even with the full run in one window, a reader is tempted to weight recent steps heavily, so an originating error stays buried and the symptom gets blamed. Doubling the window loads more without addressing that bias.

sources · 4 cited