Every tool call an agent makes costs latency, tokens, and a bet that the external API is up and returns what you expect. Most agent frameworks handle this decision with a prompt heuristic: “If you’re not sure, use the tool.” Two papers from June 2026 argue that learned policies do better, and that the hand-written rules most frameworks rely on are barely better than coin flips for the cases that matter most.
The cost of calling a tool you didn’t need
An agent that calls a search API on every turn is safe but slow. An agent that skips the call when it “knows” the answer is fast but wrong in the cases where confidence and correctness diverge. The tradeoff is not new, but it has become more visible as most open-source agent frameworks rely on prompt-based tool-calling heuristics. The prompt is usually a one-liner in the system message, tuned by hand, and never revisited.
The real costs compound. A flaky API adds latency spikes. Token spend on tool-call scaffolding (formatting the request, parsing the response) can exceed the token cost of the agent’s reasoning. And in multi-step workflows, an unnecessary tool call early in the chain produces a context window full of irrelevant output that degrades later turns.
Why hand-tuned rules fail: AgentTrust’s evidence
AgentTrust, released on arXiv in June 2026, was not designed to study tool-gating specifically. It is a self-improving trust layer that decides per action (shell commands, cloud operations, tool calls) whether to allow, warn, block, or escalate. But its findings about deterministic rules are directly relevant to anyone relying on prompt heuristics to gate tool calls.
AgentTrust distinguishes between two threat categories. Lexical threats are fixed-signature, rule-decidable patterns: a shell command containing rm -rf /, a URL pointing to a known malicious domain. Semantic threats are intent-dependent: the same AWS CLI command that is benign in one context (aws s3 ls during a deployment pipeline) is malicious in another (during a data exfiltration attempt).
The paper’s hand-authored cloud rule pack, designed to catch threats across cloud operations, database access, observability, and supply chain actions, lifted held-out accuracy only from 48% to 56% overall per AgentTrust. On semantic threat categories, it moved exactly zero percentage points: data_db stayed at 29%, observability at 59%, and supply_chain at 50% (arXiv:2606.08539). Deterministic rules cannot gate intent-dependent actions because the same surface is benign or malicious depending on context.
The implication for tool-gating: any rule-based policy that decides “call this tool when X” will handle the lexical cases (X is a straightforward condition) and miss the semantic ones (X depends on what the agent is trying to do and whether it actually needs help).
PROVE: a learned policy with an efficiency penalty
PROVE (Synthesize and Reward), submitted to arXiv on June 2, 2026 by IBM Research, approaches the problem from the other direction. Instead of writing rules about when to call tools, it trains the model to learn the policy through reinforcement learning.
The setup: four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) trained with GRPO on roughly 13,000 examples against 20 live stateful MCP servers exposing 343 tools (arXiv:2606.03892). The environments are stateful: account balances change, calendar events persist, shopping carts accumulate items. This captures failure modes that static cached APIs miss, because the model has to track what changed since the last call.
PROVE’s multi-component programmatic reward is the design choice worth paying attention to. One component is an adaptive efficiency penalty that explicitly counters the verbosity incentive built into recall-based RL rewards. The problem is straightforward: if you reward a model for getting the right answer, and the model can hedge by calling more tools to maximize ground-truth coverage, it will emit excess tool calls. The efficiency penalty directly counters that incentive, trading coverage for parsimony.
The benchmark results, as of June 2026, show gains of up to +10.2 on BFCL Multi-Turn, +6.8 on τ²-bench, and +6.5 on T-Eval, per arXiv:2606.03892.
The calibration trap
Here is the structural limitation that neither paper fully addresses: PROVE’s efficiency penalty reduces unnecessary tool calls, but it does not explicitly measure what happens when a model is confident and wrong.
A model that skips a tool call because its learned policy says “I know this one” is efficient. If the model is right, everyone wins. If the model is wrong, the tool call that would have caught the error never happens, and there is no fallback. The policy has no uncertainty signal to act on; it has a learned behavioral pattern that correlates with correctness during training but does not measure its own confidence at inference time.
CalBench identifies a parallel problem in multi-agent coordination: agents that withhold information to preserve privacy can deprive teammates of exactly the data they need. The analogy to tool-gating is direct. An agent that withholds a tool call to save tokens is an agent that keeps its uncertainty private. If the uncertainty is well-calibrated (the agent is right to be confident), the silence is efficient. If it is not, the silence is a bug.
A framework for uncertainty-aware model-based RL, submitted in late May 2026, demonstrates that explicitly handling model uncertainty mitigates exploitation of inaccurate dynamics models, with successes in hardware learning and safe exploration. This is the direction PROVE would need to move toward: not just learning when tools are empirically unnecessary, but estimating the confidence of that prediction and escalating to a tool call when confidence drops below a threshold.
What this means for agent framework builders
For anyone building with current agent frameworks, the practical takeaway is that tool-gating is migrating from prompt engineering to a learned policy. The migration has two stages:
Replace the heuristic. Instead of
"use the search tool when you need more information"in the system prompt, train or fine-tune a policy that learns when tool calls actually improve outcomes. PROVE shows this is feasible at 4B, 8B parameter scales with moderate data requirements.Audit the calibration. A learned policy is only as reliable as the model’s ability to distinguish genuine certainty from misplaced confidence. Without a calibrated confidence estimate, the policy will systematically skip tool calls on the exact errors that tools would have caught.
The ICML 2026 position paper on continual RL identifies four sources of non-stationarity after deployment that cause learned policies to drift. A tool-gating policy trained on June’s data will face different tool APIs, different failure modes, and different user behavior by September. If the policy is not updated, its learned efficiency gains become learned blind spots.
What is still unsolved
Three gaps are visible in the current work:
Confident-wrong detection. Neither PROVE nor AgentTrust provides a mechanism for catching the case where an agent’s high-confidence answer is wrong and the skipped tool would have corrected it. This is the failure mode that matters most in production, because it is silent: no error signal, no retry, no fallback.
Semantic threat coverage. AgentTrust demonstrates that rules-only gating achieves 0 percentage-point improvement on intent-dependent threats. Its learned judge reaches 83.6 to 85.2% accuracy per AgentTrust, but the gap between 85% and a reliable production gate is the gap between “works in the paper” and “works when the action is aws s3 cp s3://prod-db-backup ./exfil at 3 AM.”
Continual adaptation. The ICML position paper frames this generally for deployed RL. For tool-gating specifically: as APIs change, as new tools are added, and as the agent’s task distribution shifts, a static policy degrades. No current framework ships a mechanism for re-training or calibrating the gate post-deployment.
The migration from prompt heuristics to learned policies is real and the early results are credible. But the reliability of the gate depends on calibration quality, and calibration quality is the thing nobody is measuring yet.
Frequently Asked Questions
Can a team fine-tune tool-gating without a separate judge model?
PROVE eliminates the external judge model entirely, training with programmatic rewards on 13K examples. That is roughly 8× fewer samples than AgenticQwen’s 100K-example pipeline, which does require a judge. PROVE’s benchmarks cover only the 20 MCP environments used during training, so domains outside that coverage (medical, legal, specialized industrial tooling) still need separate validation.
What tool categories does PROVE’s training environment actually cover?
The 20 stateful MCP servers span finance, productivity, commerce, travel, social media, IoT, developer tools, and knowledge management. Each environment persists state between turns, so the model must track changes like updated account balances and accumulated cart items. This gives the learned policy breadth across common tool types, but specialized domains like medical diagnostics or legal research are not represented.
What happens when the agent is confident but wrong about not needing a tool?
Neither PROVE nor AgentTrust detects this case. The uncertainty-aware RL framework from late May 2026 addresses a structurally similar problem (preventing unsafe actions when the dynamics model is wrong) and has been validated on physical hardware. But it has not been applied to tool-gating specifically. The missing component is a runtime confidence estimator that forces a tool call whenever the policy’s skip-confidence falls below a threshold, giving the gate a circuit breaker for its own errors.
How quickly does a deployed tool-gating policy degrade without retraining?
The ICML 2026 continual RL position paper identifies four distinct sources of post-deployment non-stationarity, including shifts in task distribution and environment dynamics. For a tool-gating policy, that means new API versions, deprecated endpoints, and changing user patterns all erode accuracy over time. No current framework ships a retraining or recalibration loop for the gate, so teams need to budget periodic retraining on fresh interaction data.