Defending Agentic AI With Deception: Misdirecting Model-Guided Attacks

Defenders can derail automated jailbreak pipelines by feeding them responses that look successful but do nothing useful, instead of trying to intercept every malicious query. Reza Soosahabi and Vivek Namsani’s preprint, Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems, submitted to arXiv on June 18, 2026, introduces CMPE, a conversational misdirection technique that cuts estimated attacker-success upper bounds by up to two orders of magnitude and nearly eliminates verified success on PAIR and GPTFuzz runs; the tradeoff is a still-unmeasured cost to legitimate tasks.

Why does blocking fail against model-guided attacks?

Conventional detect-and-block defenses let attacker success rate approach one as the query budget grows, because a predictable refusal gives the attacker’s automated judge exactly the feedback it needs to keep searching. Modern automated jailbreak pipelines treat the target model as an oracle inside a search loop. Frameworks like PAIR and GPTFuzz generate prompt variants, send them to the target, and use a separate language-model judge to score whether each variant produced a useful policy-violating output. The judge is doing pattern matching: it looks for the absence of a refusal, the presence of the requested target behavior, or textual markers that signal compliance. When the target’s defense is a detect-and-block gate, every blocked prompt returns some variant of “I can’t help with that.” That is a clean, low-entropy signal. The judge marks the candidate as a failure and the mutator tries again, preserving the parts of the prompt that got closest to a non-refusal and discarding the rest.

The problem is structural. A refusal is not a dead end; it is a gradient. The attacker learns which prompt classes bounce off the filter and which ones get closer to a non-refusal. Given enough queries, the search converges on a string that either slips past the detector or triggers a corner case the filter was not trained to catch. Soosahabi and Namsani formalize this as a three-actor probabilistic model in arXiv:2606.20470: the target system, its defense mechanism, and the attacker’s automated judge. In that model, conventional blocking lets the attacker success rate asymptotically approach one because the defense’s response distribution is too informative. The more consistent the refusal, the better the feedback.

This is why detect-and-block keeps losing the cost-exchange. Every blocked query costs the defender the same as every allowed query, while every blocked query also teaches the attacker something. A perfect detector would stop this, but perfect detectors do not exist for open-ended natural language. The asymmetry favors anyone with an API budget and a mutator.

How does detect-and-misdirect turn the attacker’s judge into the weak link?

Detect-and-misdirect breaks that feedback loop by returning controlled, non-operational responses to detected malicious interactions. These replies are engineered to induce false-positive errors in the attacker’s automated judge, lowering the positive predictive value of the candidate prompts it selects and producing a bounded asymptotic attacker success rate rather than one that climbs toward one.

Instead of a refusal, the defense answers a malicious prompt with something that looks like cooperation but is not. The reply might confirm a fake plan, echo a sanitized version of the request, or provide a plausible but inert next step. To the target’s own safety filters the response is harmless; to the attacker’s judge it looks like a successful jailbreak. The judge promotes the candidate, and the pipeline wastes its next rounds refining a prompt that already reached a dead end.

A key point is that misdirection does not remove the need for detection; it changes what happens after detection. The classifier can still be imperfect, but a false positive no longer has to mean a blunt refusal. A misdirected false positive is, at worst, a confusing answer rather than a denied legitimate request, though that only holds if the misleading response is itself harmless.

The mechanism is a precision attack on the attacker’s measurement. Soosahabi and Namsani frame detect-and-misdirect as a way to reduce the positive predictive value of attacker-selected candidates. A judge with low precision keeps greenlighting duds, so the attacker’s effective yield collapses even if the raw number of non-refusals stays high. The result is a bounded asymptotic ASR: the attacker can keep spending queries, but the fraction that are genuinely useful stops rising.

This inverts the usual economics. The unreliable component is no longer the defender’s filter; it is the attacker’s own automated judge. The pipeline still runs, but it is busy confirming failures. That is the core wager of deception-based defense: misdirection is cheaper than perfect detection.

Defense	What the attacker sees	ASR trend as queries grow	Main limitation
Detect-and-block	Predictable refusal text	Approaches one	Gives the judge a clean negative signal
Detect-and-misdirect / CMPE	Plausible but non-operational response	Bounded; estimated upper bound cut by up to two orders of magnitude	Risk of degrading benign task completion; untested against adaptive attackers

How does CMPE work in practice?

CMPE, or Contextual Misdirection via Progressive Engagement, replaces predictable refusal text with safe but strategically misleading conversational responses in a lightweight, turn-by-turn interaction. The proof-of-concept is intentionally not a heavyweight rewrite of the target model. It operates at the response layer: when the defense classifies an interaction as malicious, it does not drop the conversation; it continues it with content that appears contextually appropriate but contains no actual harmful output.

The “progressive engagement” part matters. Rather than one flat redirection, the method keeps the attacker in a conversation, feeding responses that maintain plausibility across multiple turns and deepen the judge’s false confidence. A single-turn refusal is easy to classify; a multi-turn plausible-looking exchange is harder for an external judge to label as a failure, especially when the judge is itself a model looking for surface markers of compliance.

Soosahabi and Namsani evaluate CMPE against end-to-end runs of PAIR and GPTFuzz. According to the v1 preprint, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in those runs. Those are strong numbers, but the caveats are just as important. The evaluation covers two specific frameworks, both of them earlier model-guided jailbreak pipelines. Human-guided attackers, adaptive attackers who retrain or recalibrate their judge, and newer search methods are not in the benchmark.

The preprint was submitted to arXiv on June 18, 2026, with its DOI registered but still pending at the time of writing. That makes the result timely, not settled.

What does the defense cost on benign tasks?

The paper quantifies the drop in attacker success rate, but it does not quantify the cost to legitimate task completion, so any claim that misdirection is free is unsupported by this work. The central tradeoff is easy to state and hard to measure. A response engineered to fool an external judge may also confuse a legitimate user, a downstream tool call, or the agent’s own memory. If a benign but borderline query triggers the same defense path, the agent could return a fake confirmation or an inert plan instead of a refusal or a real answer. The failure mode is silent: the user gets a plausible-looking nothing, which is worse in some workflows than an explicit “I can’t do that.”

The CMPE paper does not report benign-task accuracy, latency, or user-satisfaction metrics. That absence is not a minor omission; it shifts the burden of proof onto anyone who deploys the technique. A defender who plants misleading signals now has to demonstrate that those signals do not poison normal behavior. That requires a held-out set of legitimate agent tasks, adversarial examples drawn from real user traffic, and a decision rule for when the degradation is worse than the attack it prevents.

Honeypots in classical security have the same dual ledger: they waste an attacker’s time, but they also consume defender attention and can ensnare benign scanners. The honest evaluation for CMPE is therefore a two-axis plot: attacker success rate on one axis and benign task completion on the other. A defense that drives ASR to zero while dropping legitimate completion by ten percent is not obviously better than a defense that halves both.

:::caution The headline “up to two orders of magnitude” describes a reduction in estimated ASR upper bounds on PAIR and GPTFuzz, not measured real-world attacker success. Treat the number as a benchmark ceiling, not a field guarantee. :::

:::tip Before adopting CMPE or any detect-and-misdirect layer, baseline your agent’s task-completion rate on legitimate traffic, then rerun the benchmark with the defense enabled. The number you care about is the difference, not the absolute ASR drop. :::

Where does misdirection fit in the agent-defense stack?

CMPE belongs to a broader class of deception-based defenses that are reappearing as agentic AI systems expand the attack surface beyond a single chat turn. Agentic AI security is a systems problem, not a prompt problem. A March 2026 survey accepted to USENIX Security 2026, The Attack and Defense Landscape of Agentic AI, maps the field across seven design dimensions: input trust, access sensitivity, workflow, action, memory, tool, and user interface. CMPE sits mainly in the input-trust layer, where the system decides what to do with an incoming prompt. But its effects ripple outward: a misled agent might make a bad tool call, write a misleading memory entry, or expose a harmful action through the user interface.

Parallel work is exploring deception elsewhere in the stack. Adversa AI’s June 2026 agentic-security roundup catalogs AgentShield, a defense that plants honeytokens and decoy tools so a compromised tool-using agent reveals itself. That is a different shape of misdirection: it fools the agent about its environment rather than fooling an external judge about the agent’s output. Both approaches borrow from the same conceptual lineage. A 2025 survey, AI Deception: Risks, Dynamics, and Controls, treats AI deception as the induction of false beliefs to obtain self-beneficial outcomes and formalizes the idea with signaling theory from animal-deception research. Defensive misdirection inverts the sign: the false belief is planted to protect the system, not to exploit the user.

The risk is that deception cuts both ways. If an attacker knows or suspects that a target uses CMPE, they can change the judge, add a verification step, or ask for concrete action rather than textual compliance. The defense is only as strong as the attacker’s inability to distinguish misdirection from real success. That is why CMPE is best read as one layer in a stack that still includes access controls, output filtering, and monitoring, rather than as a replacement for them.

For builders, the practical takeaway is conditional. If your agent faces automated jailbreak search in volume, detect-and-misdirect is a plausible addition to the arsenal because it raises the attacker’s cost without requiring a perfectly accurate classifier. But deploy it only with instrumentation: measure both ASR against red-team pipelines and task-completion rates on real traffic. Deception is a tactic; the strategy is still to make the attacker’s model less useful than the defender’s.

Frequently Asked Questions

Does CMPE protect against human red teams, or only automated pipelines?

CMPE is aimed at model-guided automated judges such as those inside PAIR and GPTFuzz. A human attacker who reads the reply and checks for real harmful action would not be fooled, because the defense wins by tricking a pattern-matching language-model judge rather than by hiding the failure from a person.

How is CMPE different from randomizing refusal text?

Randomizing refusals lowers the signal quality of each block, but it still tells the judge the prompt failed. CMPE feeds the judge a false positive, so the attacker burns budget refining a candidate that already reached a dead end.

What telemetry should a team add before enabling CMPE?

Teams need logs that pair detected attacks with downstream agent actions, plus a held-out set of legitimate tasks. A misdirected benign query gives a plausible but wrong response, so standard refusal logs alone will not catch the resulting silent failure.

What happens if the attacker swaps the automated judge for a human reviewer?

The defense bound collapses. A human reviewer can verify whether the response actually performs the requested harmful action, while CMPE is tuned to exploit the surface markers that language-model judges rely on.

Could an adaptive attacker learn to detect CMPE-style misdirection?

Yes. If an attacker gathers enough examples, they can train a verifier to demand concrete outcomes instead of textual compliance. That would force defenders to update both the detector and the misleading response distribution to keep the judge confused.