groundy
models & research

When an AI Agent's Tools Break, Can It Recover? A New Benchmark

ToolMaze, a new arXiv benchmark, shows LLM agents' recovery rates drop 37% when tools return corrupted data, exposing a gap in how agent reliability is measured.

6 min · · · 4 sources ↓

Most agent benchmarks assume tools work. ToolMaze, submitted to arXiv on June 4, tests what happens when they don’t, and the results are unflattering: when tools return plausible-looking but corrupted data, agents’ recovery rates drop roughly 37%. The benchmark exposes a gap that task-completion scores on clean inputs were never designed to catch.

What ToolMaze tests that others don’t

Existing agent benchmarks, from SWE-bench to WebArena, measure whether an LLM can call the right tool and get the right answer under ideal conditions. ToolMaze flips the premise. It injects perturbations into tool outputs and measures whether the agent notices, replans, and recovers. The perturbations follow a 2×2 taxonomy: explicit failures (the API returns an error) versus implicit failures (the API returns corrupted data that looks valid), crossed with transient failures (temporary) versus permanent failures (persistent across retries) (arXiv:2606.05806).

The distinction matters because production agentic workflows do not run on happy paths. APIs rate-limit, databases return stale reads, third-party services degrade silently. A benchmark that only tests tool-calling accuracy on clean data is measuring the wrong thing for anyone shipping agents to environments where infrastructure is imperfect.

The implicit-failure blind spot

The most striking finding from ToolMaze is not that agents fail when tools throw errors. Explicit failures are obvious: the API returns a 500, the agent knows something went wrong, and it can retry or switch strategies. The problem is implicit failures, where tools return data that is semantically corrupted but superficially plausible.

Under these conditions, the Perturbation Recovery Rate (PRR) drops approximately 37%, according to the ToolMaze paper. The mechanism is straightforward: agents exhibit systemic over-trust in tool outputs. If the response has the right shape and field names, the agent treats it as ground truth and builds downstream reasoning on top of a corrupted foundation. No error signal means no replanning trigger.

This is the failure mode that matters in production. A stale price feed, a partially updated database row, or a truncated API response does not come with a banner announcing itself. The agent that trusts without verifying is the agent that produces confidently wrong output.

Scaling helps, but not enough

Larger models recover from tool failures better than smaller ones. That is expected. What is not expected is the rate at which the improvement arrives.

According to ToolMaze, agentic fault-tolerance improves with model scale at 3.66× the rate of basic task execution improvement. Stated from the other direction: the capability gap between “can this model use tools when they work?” and “can this model recover when they don’t?” widens as models get bigger, because replanning under uncertainty scales much more slowly than straightforward tool invocation. Neither model scaling nor prompting alone closes this gap.

The paper also reports that complex DAG-based task topologies exacerbate the problem. When an agent’s workflow involves dependencies between tool calls, a failure in one node can trap the agent in futile retry loops rather than prompting systematic replanning. The more interconnected the task graph, the harder recovery becomes.

What this means for production agent reliability reporting

The ToolMaze findings suggest a basic accountability gap in how agent systems are evaluated and marketed. Pass rates on benchmarks like SWE-bench or WebArena say nothing about behavior under degradation. An agent that passes a clean benchmark at high rates but collapses when an API returns stale data is a different engineering proposition than those scores imply.

For teams shipping agentic workflows, the practical takeaway is to inject failures into evaluation pipelines. Run the same task suite with perturbed tool outputs and measure the delta. If the recovery rate is dramatically lower than the clean pass rate, the agent’s reliability ceiling is lower than reported metrics suggest. ToolMaze’s 2×2 taxonomy (explicit/implicit × transient/permanent) provides a workable starting framework for structuring those tests.

A week of failure-mode benchmarks

ToolMaze is not an isolated data point. The same week saw a cluster of benchmarks that test AI systems under conditions most evaluations ignore.

PortBench, a full-pipeline benchmark for LLM-driven portfolio management, found that 90% of model-profile combinations failed to outperform a basic equal-weight allocation. Models that satisfied every procedural constraint still suffered catastrophic drawdowns under stress. The benchmark tests not whether an LLM can reason about finance, but whether that reasoning survives contact with volatile, real-world data.

SubtleMemory targets a different failure mode: relational memory in long-horizon agents. Across 1,522 evaluation instances spanning 10 long histories, current memory systems remained weak at fine-grained relational discrimination. Agents that can recall a fact but cannot distinguish which entity it attaches to in a long context are agents that will hallucinate relationships.

When Gradients Collide, accepted at ACL 2026, examines multi-objective prompt optimization for LLM judges. The paper identifies two separable failure modes. Optimization-time gradient dilution occurs when the gradient LLM must provide feedback on multiple criteria jointly, dropping task-focus 59% (from 9.0 to 3.7 out of 10). Inference-time instruction interference occurs when naively combining single-objective optimized instructions into a single prompt, degrading Spearman rho from 0.305 to 0.220. Together, these constrain the design space for multi-objective judge optimization using textual feedback.

These benchmarks share a structural pattern. They do not test whether AI systems work under ideal conditions; they test where they break. That shift, from capability measurement to failure-mode characterization, is where evaluation is heading. ToolMaze’s contribution is making that shift explicit for the tool-use layer, the point where agents touch external systems and where most real-world failures originate.

Frequently Asked Questions

Yes. Netflix’s Chaos Monkey injects failures into production infrastructure to surface resilience gaps; ToolMaze applies the same principle to the agent-tool interface. Teams already running chaos experiments on their APIs could extend those fault-injection campaigns to cover implicit semantic corruption (plausible but wrong responses), not just latency spikes and error-rate surges. The 2x2 taxonomy maps roughly onto established fault-injection categories: explicit perturbations resemble kill/fault injections, while implicit ones resemble byzantine faults where the component lies rather than fails.

What do the PortBench and ToolMaze results share structurally?

Both find that passing procedural checks predicts nothing about behavior under stress. PortBench models that satisfied every stated constraint still suffered catastrophic drawdowns in volatile markets. ToolMaze agents that call tools correctly on clean inputs collapse when those tools return corrupted data. The shared pattern: constraint satisfaction on happy paths is a poor proxy for robustness, and benchmarks that only measure the former overstate real-world dependability.

How do SubtleMemory and ToolMaze target different failure modes?

SubtleMemory tests whether agents correctly attribute recalled facts to the right entity across long contexts: a discrimination problem. ToolMaze tests whether agents detect corrupted tool outputs at all: a trust-calibration problem. An agent failing SubtleMemory will conflate entities in its response. An agent failing ToolMaze will build plans on fabricated data. Both produce confident errors, but the interventions diverge: better retrieval architectures for the former, output-validation scaffolding for the latter.

What would narrow the 3.66x scaling asymmetry between tool use and recovery?

The asymmetry persists because replanning under uncertainty requires a different capability profile than correct tool invocation. Narrowing it likely demands architectural changes beyond bigger models: dedicated verification sub-agents that cross-check tool outputs before the planner acts on them, or training regimens that penalize over-trust in external responses. The ToolMaze authors found that prompting strategies alone did not close the gap, suggesting the fix is structural rather than instructional.

sources · 4 cited

  1. When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents primary accessed 2026-06-07
  2. PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management primary accessed 2026-06-07
  3. SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents primary accessed 2026-06-07
  4. When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges primary accessed 2026-06-07