MCP Tool Description Poisoning: New Benchmark Shows Agents Trust Manuals That Lie

What Tool Description Poisoning Does to an Agent

arXiv:2605.24069, a preprint submitted May 22 by Shi Liu and six co-authors [Updated June 2026: the article previously credited co-author Xikang Yang as sole author; Shi Liu is the lead], introduces Tool Description Poisoning (TDP): an attack that injects malicious instructions into a tool’s descriptive metadata rather than its executable code. The agent reads a tool’s description to decide how and when to call it. If that description lies, the agent plans and executes actions based on falsified intent. The code itself can be benign.

This is a narrow but important attack surface. The MCP ecosystem has seen security scrutiny focused on two layers: remote code execution vulnerabilities in MCP servers, and prompt injection delivered through tool output. TDP operates on a third layer, the description metadata that agents implicitly trust when planning which tool to invoke and with what arguments. A malicious MCP server whose runtime passes every sandbox check and whose output is clean can still compromise an agent if its self-description is crafted to mislead.

The MCP-TDP Benchmark: 32 Test Cases Across 6 Risk Categories

Yang’s paper contributes the MCP-TDP Security Benchmark, the first dedicated benchmark for evaluating tool-description-level attacks against LLM agents. It contains 32 realistic test cases spanning 6 distinct risk categories, run in a high-fidelity sandbox environment designed to approximate the tool-calling loop agents use in production.

The benchmark’s design choice is worth noting. Rather than testing whether a model can be confused by adversarial text in general, it tests whether a model will act on a false claim about what a tool does. The distinction matters: the attack vector is the agent’s planning stage, not its output-parsing stage. The model reads a description that says “this tool sends a notification to the team channel” when the tool actually exfiltrates credentials. The model calls the tool with full confidence because the manual told it to.

The Firewall Fallacy

The paper identifies what it calls the “Firewall Fallacy”: common prompt-guardrail defenses are not merely bypassed by TDP attacks, but can be counterproductive. The mechanism is straightforward. Guardrails that filter or rewrite tool inputs and outputs operate downstream of the planning decision. By the time the guardrail inspects a tool call, the agent has already committed to an action based on the poisoned description. The guardrail sees a plausible-looking call to a plausible-looking tool and has no reason to block it.

This finding needs nuance. The paper’s claim that guardrails are “counterproductive” likely means current guardrails are poorly calibrated for this threat model, not that all input filtering is inherently harmful. The distinction between “this specific defense doesn’t work against this specific attack” and “defense is useless” is one the paper should be read carefully on.

GPT-4o’s Near-Perfect Attack Success Rate

The benchmark evaluated 8 mainstream LLMs. According to the paper, GPT-4o exhibited a nearly 100% Attack Success Rate (ASR) across six high-risk scenarios. The paper does not break out per-model ASR figures for the other seven models in the available abstract, so comparative rankings beyond GPT-4o’s worst-case performance are not reported here.

Corroborating evidence from the same week strengthens the concern. PoisonForge (arXiv:2605.23168), from Luze Sun and collaborators including Alina Oprea, demonstrated that 11 of 12 open-weight models (2B to 32B parameters) exceeded 70% ASR in their most vulnerable configuration, with only 10 poisoned examples embedded in 1,000 fine-tuning samples [Updated June 2026: the 70% figure is the worst-case configuration, not the average across all setups]. PoisonForge’s finding suggests that the susceptibility is driven by poisoning design choices, not model scale. A small, well-crafted perturbation in training data is sufficient to produce reliably exploitable behavior across most tested models.

TDP in Context: How It Stacks Up Against Real-World MCP Poisoning Benchmarks

The “When the Manual Lies” benchmark is synthetic by design: 32 hand-built scenarios in a sandbox. Two benchmarks released earlier test the same threat against tools that actually exist, and their results frame how much to read into a near-perfect ASR. [Updated June 2026: this section is new, adding real-world corroboration.]

MCPTox (arXiv:2508.14925) built 1,312 malicious test cases on top of 45 live production MCP servers exposing 353 real tools, then ran them against 20 LLM agents. Its worst performer, o1-mini, hit a 72.8% attack success rate. The most resistant, Claude-3.7-Sonnet, refused the poisoned calls more than 97% of the time. That spread is the part worth noting. Where the TDP paper reports near-total compromise for GPT-4o, MCPTox shows susceptibility is strongly model-dependent once the attack runs against tools with real schemas and observable side effects. A benchmark that fixes on one model’s worst case tells you the ceiling of the threat, not its distribution.

MCPSecBench (arXiv:2508.13220) takes the taxonomy view, cataloguing 17 distinct attack types across four attack surfaces: the prompt, the tool, the transport, and the client. Tool-description poisoning is one cell in that grid. Read together, the three papers place TDP not as a new category but as the sharpest demonstration of a known one, the planning-stage attack surface that MCPSecBench names and MCPTox measures in the field.

The closest sibling is the measurement work on description-code inconsistency. A study of 2,214 real MCP servers found that 9.93% of tool descriptions already diverge from what the handler code does, with no adversary involved. TDP weaponizes that gap deliberately; the inconsistency study shows the gap is already common in shipped code. The two bracket the problem: a runtime that selects tools by description is exposed whether the mismatch is malicious or merely sloppy.

Why the Planning Stage Trusts the Description

TDP works for a structural reason, not because a model is gullible in some fixable way. When an MCP host assembles a request, every available tool’s name, description, and parameter schema is serialized into the model’s context as part of the system prompt. The model does not see code, capabilities, or a permission manifest. It sees text, and that text arrives with the same epistemic status as the developer’s own instructions. No field in the tool schema marks a claim as unverified.

That is what makes the description a control-plane input rather than documentation. In a conventional API client, the function signature is binding and the doc comment is advisory; the compiler ignores the prose. In an agent, the prose is the dispatch logic. The model reads “sends a notification to the team channel,” matches it against the user’s intent, and emits a call. Nothing downstream re-reads the actual handler to confirm.

This also explains why the Firewall Fallacy holds. Learned tool-gating policies and input filters both operate on the call the model has already decided to make. They can score the arguments, rate-limit the tool, or demand confirmation, but they inherit the same poisoned premise: that the tool is what its description says. A guardrail that trusts the manifest cannot catch a lie told by the manifest.

None of this means a TDP payload is trivial to land. The attacker still needs a poisoned description to reach a production toolkit, which means compromising a registry entry, a supply-chain dependency, or a server the agent already trusts. The benchmark measures what happens after that step, not the odds of clearing it. The after-step result is the uncomfortable one: once a lying description is in context, the model’s competence works against it, because a more capable planner follows the manual more faithfully.

Emerging Defenses: Reactive Self-Correction and Certified Predicates

The paper proposes “Reactive Self-Correction” as a defense: the agent autonomously detects and reverts its own malicious actions after execution. The concept has intuitive appeal, turning the agent’s post-hoc reasoning against the poisoned planning. However, the paper does not quantify Reactive Self-Correction’s effectiveness against TDP specifically in the available abstract, so claims about its practical utility should be treated as preliminary.

These two defenses represent opposite points on a design spectrum. Reactive Self-Correction keeps the agent in the loop and asks it to fix its own mistakes. The certified-predicate approach removes the agent’s discretion for specific safety-critical actions and delegates the decision to a deterministic gate. Neither is a complete answer. Reactive Self-Correction depends on the agent’s ability to recognize its own compromised behavior, which is the same capability TDP exploits in the first place. Certified predicates require enumerating safe behavior in advance, which scales poorly as toolkits grow.

What This Means for MCP Registry Operators and Agent Builders

The structural implication is direct. Hardened MCP runtimes, transport-level encryption, and registry code scanners all verify that a tool’s implementation matches some declared standard. None of them verify that a tool’s description matches what the tool actually does. A malicious server whose code is clean and whose transport is encrypted but whose description misrepresents its behavior will pass every existing checkpoint.

For MCP registry operators, this introduces a new verification requirement: semantic alignment between declared tool behavior and observed tool behavior. That is a harder problem than code signing or dependency scanning. It requires either runtime observation of tool behavior against claimed behavior, or a trusted-attestation model where tool descriptions are signed by a verifier who has tested the tool’s actual effects.

For agent builders, the paper’s results suggest that tool-calling architectures need a distrust layer. Agents that select tools based solely on self-reported descriptions are trusting a trust boundary that has no enforcement mechanism. The options are to add post-call verification (the Reactive Self-Correction approach), pre-call certification (the certified-predicate approach), or both.

The tool-description layer is not a new attack surface in principle. Any system that selects actions based on self-reported metadata has this property. What the benchmark shows is that current LLM agents are acutely vulnerable to it, that the vulnerability is nearly total for at least one widely deployed model, and that the defenses most operators assume they have do not apply here.

Frequently Asked Questions

Can Reactive Self-Correction catch a TDP attack that leaves no visible trace?

Only if the poisoned tool’s effects are observable. A tool that silently exfiltrates data through an encrypted channel or stages its malicious action for a later session produces nothing for the agent to notice post-execution. Reactive Self-Correction assumes the attack leaves detectable side effects, which a carefully crafted TDP payload would avoid by design.

How does the certified-predicate defense handle tools whose safety depends on calling context?

It does not. Certified predicates treat safety as a property of the tool itself, certifying that a file-read tool, for example, only reads files. But the same file-read call is safe in a summarization workflow and dangerous in an agent that also holds credential tokens. As MCP toolkits grow to serve multiple agent roles, static per-tool predicates break when context determines whether a call is benign.

Does fine-tuning on clean data protect a model against TDP?

Not reliably. PoisonForge achieved over 70% ASR across 11 of 12 open-weight models by embedding just 10 poisoned examples among 1,000 fine-tuning samples, a 1% contamination rate. Most organizations fine-tune on heterogeneous datasets they cannot fully audit. The attack’s efficiency at such low contamination rates means data curation alone is a weak defense.

Why does TDP risk grow faster than the MCP ecosystem’s ability to address it?

Every tool added to a production agent’s toolkit introduces a new self-reported description that is implicitly trusted. The attack surface scales linearly with toolkit size, but semantic verification of descriptions against actual tool behavior is a manual or semi-automated process with no established tooling. Adoption of MCP across agent frameworks is accelerating, expanding the trusted-description surface faster than verification capacity.