groundy
security

MCP Tool Description Poisoning: New Benchmark Shows Agents Trust Manuals That Lie

A new MCP benchmark shows GPT-4o susceptible to nearly 100% of attacks where a tool's description lies about its purpose, a gap runtimes and scanners cannot detect.

6 min · · · 3 sources ↓

What Tool Description Poisoning Does to an Agent

arXiv:2605.24069, a preprint submitted May 22 by Xikang Yang, introduces Tool Description Poisoning (TDP): an attack that injects malicious instructions into a tool’s descriptive metadata rather than its executable code. The agent reads a tool’s description to decide how and when to call it. If that description lies, the agent plans and executes actions based on falsified intent. The code itself can be benign.

This is a narrow but important attack surface. The MCP ecosystem has seen security scrutiny focused on two layers: remote code execution vulnerabilities in MCP servers, and prompt injection delivered through tool output. TDP operates on a third layer, the description metadata that agents implicitly trust when planning which tool to invoke and with what arguments. A malicious MCP server whose runtime passes every sandbox check and whose output is clean can still compromise an agent if its self-description is crafted to mislead.

The MCP-TDP Benchmark: 32 Test Cases Across 6 Risk Categories

Yang’s paper contributes the MCP-TDP Security Benchmark, the first dedicated benchmark for evaluating tool-description-level attacks against LLM agents. It contains 32 realistic test cases spanning 6 distinct risk categories, run in a high-fidelity sandbox environment designed to approximate the tool-calling loop agents use in production.

The benchmark’s design choice is worth noting. Rather than testing whether a model can be confused by adversarial text in general, it tests whether a model will act on a false claim about what a tool does. The distinction matters: the attack vector is the agent’s planning stage, not its output-parsing stage. The model reads a description that says “this tool sends a notification to the team channel” when the tool actually exfiltrates credentials. The model calls the tool with full confidence because the manual told it to.

The Firewall Fallacy

The paper identifies what it calls the “Firewall Fallacy”: common prompt-guardrail defenses are not merely bypassed by TDP attacks, but can be counterproductive. The mechanism is straightforward. Guardrails that filter or rewrite tool inputs and outputs operate downstream of the planning decision. By the time the guardrail inspects a tool call, the agent has already committed to an action based on the poisoned description. The guardrail sees a plausible-looking call to a plausible-looking tool and has no reason to block it.

This finding needs nuance. The paper’s claim that guardrails are “counterproductive” likely means current guardrails are poorly calibrated for this threat model, not that all input filtering is inherently harmful. The distinction between “this specific defense doesn’t work against this specific attack” and “defense is useless” is one the paper should be read carefully on.

GPT-4o’s Near-Perfect Attack Success Rate

The benchmark evaluated 8 mainstream LLMs. According to the paper, GPT-4o exhibited a nearly 100% Attack Success Rate (ASR) across six high-risk scenarios. The paper does not break out per-model ASR figures for the other seven models in the available abstract, so comparative rankings beyond GPT-4o’s worst-case performance are not reported here.

Corroborating evidence from the same week strengthens the concern. PoisonForge (arXiv:2605.23168) demonstrated that 11 of 12 open-weight models exceeded 70% ASR with only 10 poisoned examples embedded in 1,000 fine-tuning samples. PoisonForge’s finding suggests that the susceptibility is driven by poisoning design choices, not model scale. A small, well-crafted perturbation in training data is sufficient to produce reliably exploitable behavior across most tested models.

Emerging Defenses: Reactive Self-Correction and Certified Predicates

The paper proposes “Reactive Self-Correction” as a defense: the agent autonomously detects and reverts its own malicious actions after execution. The concept has intuitive appeal, turning the agent’s post-hoc reasoning against the poisoned planning. However, the paper does not quantify Reactive Self-Correction’s effectiveness against TDP specifically in the available abstract, so claims about its practical utility should be treated as preliminary.

These two defenses represent opposite points on a design spectrum. Reactive Self-Correction keeps the agent in the loop and asks it to fix its own mistakes. The certified-predicate approach removes the agent’s discretion for specific safety-critical actions and delegates the decision to a deterministic gate. Neither is a complete answer. Reactive Self-Correction depends on the agent’s ability to recognize its own compromised behavior, which is the same capability TDP exploits in the first place. Certified predicates require enumerating safe behavior in advance, which scales poorly as toolkits grow.

What This Means for MCP Registry Operators and Agent Builders

The structural implication is direct. Hardened MCP runtimes, transport-level encryption, and registry code scanners all verify that a tool’s implementation matches some declared standard. None of them verify that a tool’s description matches what the tool actually does. A malicious server whose code is clean and whose transport is encrypted but whose description misrepresents its behavior will pass every existing checkpoint.

For MCP registry operators, this introduces a new verification requirement: semantic alignment between declared tool behavior and observed tool behavior. That is a harder problem than code signing or dependency scanning. It requires either runtime observation of tool behavior against claimed behavior, or a trusted-attestation model where tool descriptions are signed by a verifier who has tested the tool’s actual effects.

For agent builders, the paper’s results suggest that tool-calling architectures need a distrust layer. Agents that select tools based solely on self-reported descriptions are trusting a trust boundary that has no enforcement mechanism. The options are to add post-call verification (the Reactive Self-Correction approach), pre-call certification (the certified-predicate approach), or both.

The tool-description layer is not a new attack surface in principle. Any system that selects actions based on self-reported metadata has this property. What the benchmark shows is that current LLM agents are acutely vulnerable to it, that the vulnerability is nearly total for at least one widely deployed model, and that the defenses most operators assume they have do not apply here.

Frequently Asked Questions

Can Reactive Self-Correction catch a TDP attack that leaves no visible trace?

Only if the poisoned tool’s effects are observable. A tool that silently exfiltrates data through an encrypted channel or stages its malicious action for a later session produces nothing for the agent to notice post-execution. Reactive Self-Correction assumes the attack leaves detectable side effects, which a carefully crafted TDP payload would avoid by design.

How does the certified-predicate defense handle tools whose safety depends on calling context?

It does not. Certified predicates treat safety as a property of the tool itself, certifying that a file-read tool, for example, only reads files. But the same file-read call is safe in a summarization workflow and dangerous in an agent that also holds credential tokens. As MCP toolkits grow to serve multiple agent roles, static per-tool predicates break when context determines whether a call is benign.

Does fine-tuning on clean data protect a model against TDP?

Not reliably. PoisonForge achieved over 70% ASR across 11 of 12 open-weight models by embedding just 10 poisoned examples among 1,000 fine-tuning samples, a 1% contamination rate. Most organizations fine-tune on heterogeneous datasets they cannot fully audit. The attack’s efficiency at such low contamination rates means data curation alone is a weak defense.

Why does TDP risk grow faster than the MCP ecosystem’s ability to address it?

Every tool added to a production agent’s toolkit introduces a new self-reported description that is implicitly trusted. The attack surface scales linearly with toolkit size, but semantic verification of descriptions against actual tool behavior is a manual or semi-automated process with no established tooling. Adoption of MCP across agent frameworks is accelerating, expanding the trusted-description surface faster than verification capacity.

sources · 3 cited

  1. When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents primary accessed 2026-05-27
  2. PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs primary accessed 2026-05-27
  3. Hallucination as Exploit: Evidence-Carrying Multimodal Agents primary accessed 2026-05-27