Do Retrieval Metrics Predict Tool-Use Agent Success? A Paper Says No

Standard retrieval metrics like recall@k do not reliably predict whether a tool-use agent picks the right action, according to a June 2026 arXiv paper. Tested on tau-bench, a 7% rank-1 recall still produced near-gold policy classification (0.58 vs 0.60 macro-F1), which means a retriever that looks broken by the usual yardstick can still feed an agent the context it needs. The proxy and the outcome it stands in for are only loosely coupled, and on this benchmark the proxy can mislead in both directions.

Why a better retriever doesn’t guarantee a better agent

The default assumption in RAG-backed agent tuning is that lifting retrieval recall lifts downstream task quality, and the tau-bench result is precisely the stress test that puts that link under strain. Teams measure the retriever because it is the cheap thing to measure. You hold out a set of queries with known gold chunks, and recall@k or nDCG runs offline, deterministically, in seconds. The policy head, the tool-selection step, and end-to-end task success are expensive because each evaluation requires running the full agent and labeling the outcome against a rubric. So the retriever metric drifts into the role of a stand-in for the number a team actually cares about: did the agent do the right thing.

The paper names this directly. Exact-match retrieval recall is “often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model” arXiv:2606.23937. The proxy is only as good as the correlation between two things: rank-1 presence of the exact gold clause, and the downstream decision being correct. That correlation can be weak, can be noisy, and can run in the opposite direction from what an engineer optimizing recall would assume. The tau-bench setup is a clean place to measure that correlation because the benchmark ships a designated gold policy clause per decision step, which lets the authors swap it out and watch what happens to the classifier.

What the tau-bench experiment measured

On tau-bench’s airline domain, the exact governing policy clause was retrieved at rank 1 for only 7% of states, yet the Qwen2.5-3B classifier fed those retrieved clauses reached macro-F1 0.58, against 0.60 with the gold clauses arXiv:2606.23937. The gap is two points of macro-F1 despite a retriever that, by the rank-1-recall metric, is failing on 93% of states. The full spread of conditions is more telling than the headline pair:

Context fed to the 3B classifier	macro-F1	Notes
Gold policy clauses	0.60	Benchmark-designated correct clause
Top-ranked retrieved clauses	0.58	Rank-1 recall was only 7% of airline states
Mismatched-policy clauses	0.32	Wrong-domain policy as a control
No policy (baseline)	0.21	Classifier with no policy context at all

The two controls do the load-bearing work of the argument. Mismatched policy scores 0.32 and no policy scores 0.21, so retrieved context is not just a little better than nothing: it lifts a classifier from 0.21 to 0.58, which is most of the way to the gold-clause ceiling of 0.60 arXiv:2606.23937. A retriever that recall@k ranks as near-useless is, measured by the decision it feeds, supplying substantial policy signal.

The experiment also isolates a second variable the retriever-only view ignores entirely. Under gold-policy conditioning, a compact structured representation of state improved macro-F1 over raw trajectories by 0.13 to 0.17 after tuning arXiv:2606.23937. How the state is represented to the classifier moves the decision by more than the two-point gap between retrieved and gold clauses. State representation, not retrieval rank, is doing real work in the downstream number, and recall@k has nothing to say about it.

When recall@k misleads: the ordering trap and the underestimate

Recall@k fails as an agent proxy in two distinct ways. It can report a retriever as broken when it is feeding the model useful context, and it can ignore rank order even when order is the thing that matters.

The tau-bench result is the first failure mode in the understated direction. Seven percent rank-1 recall but 0.58 macro-F1 means recall@k flagged the retriever as close to useless while the policy head found the signal anyway arXiv:2606.23937. The most likely mechanism is that the top-ranked clause, even when it is not the single exact gold clause, still carries relevant policy content the classifier can use. The model tolerates a near-miss. The practical consequence is uncomfortable for anyone who has been tuning a retriever on recall: pushing rank-1 recall upward may move the metric a lot and the decision not at all. The authors’ phrasing is the careful one: exact-match clause recall can underestimate downstream policy utility. It is not merely a noisy proxy; it is biased toward declaring the retriever worse at its job than the downstream evidence shows.

The second failure mode is structural and older than this paper. A known RAG-evaluation pitfall is high Recall@K coexisting with low MRR or nDCG: the system finds the right chunk but ranks it poorly, and recall alone ignores ordering Tencent’s RAG metrics primer. For an agent that reads a fixed top-k window, position is existence. The correct clause at rank 9 of a five-chunk context is functionally the same as never retrieving it. Recall@k counts both as a hit; the agent sees neither. This is why the primer pairs recall with MRR or nDCG rather than reporting it alone, and it is the same reason a pure-recall proxy can look healthy while the agent it feeds starves for the right clause in the right slot.

What to measure instead: put the policy head in the loop

The paper’s prescription is to stop scoring the retriever in isolation and instead measure policy signal with retrieved context fed into the classification loop arXiv:2606.23937. Evaluate on the signal you care about, which is correct action selection, not retrieval rank. This is where 2026 agent-evaluation practice has independently landed: measure task success, tool-selection quality, and policy or constraint satisfaction directly, treating retrieval quality as one input among several rather than a stand-in for agent quality Tom V Saji’s 2026 agent-evaluation guide. Functional metrics such as task completion rate and tool selection precision are what define reliability, not recall Adam Bernard’s agent evaluation reference.

A workable evaluation loop follows directly from the tau-bench design. Build a held-out set of decision-time states with gold actions. Run the retriever and the policy head together as a unit. Score macro-F1 or action accuracy end-to-end. Then vary the retriever as the independent variable while holding the policy head fixed, which isolates how much retrieval rank actually moves the decision. If swapping a weaker retriever for a stronger one lifts recall@k by twenty points and macro-F1 by one, you have just measured that your retriever tuning was optimizing a number the agent barely reads. That is the result the paper is warning teams to expect, and it is only visible when the policy head is inside the eval rather than outside it.

What the paper does not claim

The defensible claim is narrower than the title implies, and reading past the caveats would turn a precise result into a slogan.

First, the retrieved-versus-gold comparison is not a proven equivalence. The task-cluster 95% confidence interval on the macro-F1 difference is [-0.23, +0.21] arXiv:2606.23937. The authors “do not detect a macro-F1 difference” in this configuration, but an interval that wide cannot establish non-inferiority. The honest reading is that retrieved clauses were not detectably worse than gold on this run, not that they are as good. A larger sample could narrow that interval and reveal a real gap, or could confirm the near-tie; either is consistent with the data shown.

Second, the result is benchmark- and model-specific. It uses tau-bench and Qwen2.5 classifiers at the 3B and 7B scale. The same qualitative pattern, retrieved clauses tracking gold clauses despite low rank-1 recall, appears with a second retriever and at 7B, but it “varies across fine-tuning configurations” arXiv:2606.23937. Generalization to other benchmarks, to frontier-scale policy heads, or to policy domains denser than airline rules is not demonstrated. A retriever feeding a model with less policy prior, or a domain where the exact clause is the only useful one, could behave very differently.

Third, and most consequential for the title: the setup is pre-action policy classification, a single decision step with a labeled correct policy. It is not end-to-end long-horizon task success. Full multi-step agent runs compound retrieval noise, because a wrong clause at step three can steer every subsequent step. The paper’s own framing is “policy signal,” not “task success.” So “retrieval metrics don’t predict agent success” is a reasonable one-line gloss of the direction, but the measured claim is about one decision step on one benchmark, not about whether a worse retriever sinks a forty-step trajectory. The methodological lesson survives that narrowing: do not proxy agent quality through retrieval recall. Measure the policy signal with the agent in the loop, and let the retriever’s contribution show up, or fail to, in the decision it actually produces.

Frequently Asked Questions

Does this result transfer to standard document-QA retrieval systems?

No, and the mechanism is the classifier’s existing policy knowledge. The Qwen2.5 model still scores 0.21 macro-F1 with no policy context at all, so it carries a non-trivial prior that retrieved clauses reinforce rather than supply from scratch. A document-QA system over a corpus the model has never seen stays tightly coupled to recall@k, because the retrieved chunk is the answer rather than reinforcement of a prior.

In what domains would recall@k actually track the agent’s decision?

Domains where a near-miss clause is dangerous rather than tolerable. Medical contraindication lookup, legal statute retrieval, and tax-rule citation all punish the semantic adjacency the Qwen2.5 classifier exploits: a clause that is 90 percent correct can flip the action, so rank-9 retrieval is not a near-miss the model can absorb. There the coupling between recall@k and the decision would tighten, and the paper’s underestimate would not appear.

What is the real cost of putting the policy head inside the eval loop?

The labor is in labeling decision-time states with gold actions, not in running the model. Recall@k runs offline against a fixed gold-chunk set in seconds; the policy-in-loop eval needs a held-out set of states with a human-judged correct action per state. tau-bench ships those labels free, but for a proprietary agent a team must build the labeled set before any sweep, which is the actual barrier to adoption.

How would the per-step near-tie change across a forty-step agent run?

Single-step errors compound multiplicatively, not additively. If per-step action accuracy were independent across a forty-step run, even a two-point macro-F1 gap per step would translate into a wide divergence in end-to-end task completion, because trajectory success approximates the product of step-level success rates. The paper’s per-decision near-tie therefore cannot be read as a trajectory-level guarantee, and task completion is the metric that would expose the gap.