Metis, an ICML 2026 paper by Huilin Zhou and collaborators1, treats jailbreak discovery as inference-time policy optimization over an adversarial partially observable Markov decision process. Rather than handcrafting prompts or curating fixed multi-turn attack corpora, the agent rewrites its own strategy between turns using causal diagnosis of refusal signals. The result: 89.2% average attack success rate across ten models on HarmBench and AdvBench, including 76.0% on OpenAI O1 and 78.0% on GPT-5-chat, according to the paper’s v3 posted May 21.
From Static Prompts to Inference-Time Policy Optimization
Most prior multi-turn jailbreak methods operate as stochastic search over a fixed heuristic space. Crescendo, ActorBreaker, and X-Teaming encode their attack logic before the first turn and vary it through random perturbation. The Metis paper frames this as the root cause of a “generalization gap”: methods that perform well on weaker open-weight models degrade substantially when pointed at frontier systems with stronger refusal behavior.
Metis replaces stochastic exploration with directed search. It models the jailbreak problem as an adversarial POMDP, where the attacker does not know the target model’s defense configuration and must infer it during the attack. A dual-agent loop drives the process: an Attacker maintains a belief state over the target’s unknown defenses, and a Metacognitive Evaluator produces what the authors call “semantic gradient” feedback after each turn, a dense signal describing why a refusal occurred and what strategy shift might circumvent it.
This is not prompt engineering with extra steps. The agent does not interpolate between templates. It diagnoses the causal structure of a refusal and generates a novel attack strategy tailored to that diagnosis, then evaluates its own output before the next turn. The paper reports an average 8.2x reduction in token costs compared to baselines, and up to 11.4x in the best case, because the directed search wastes fewer turns on unpromising directions.
The Mechanism: Belief State, Semantic Gradient, Self-Rewrite
The architecture has three moving parts worth distinguishing.
Attacker agent. Holds a belief state representing its current model of how the target will respond to different prompt structures. Each turn updates this belief based on the target’s actual output, not just whether it refused.
Metacognitive Evaluator. After each turn, this second agent reads the exchange and produces structured feedback: what in the prompt triggered refusal, what semantic category the refusal falls into, and what strategy shift is most likely to succeed next. The paper calls this a “semantic gradient” because it approximates the gradient signal you would get if you could backpropagate through the target model’s refusal mechanism. You cannot, of course; the evaluator is approximating what that gradient would tell you.
Policy update. The attacker rewrites its own system prompt before the next turn, incorporating the evaluator’s feedback. This is where the “self-evolving” in the paper’s title lives: the attack policy is not fixed at initialization but updated as a function of observed responses.
The key insight is that this loop produces a compounding advantage. Stochastic methods explore the attack space uniformly. Metis narrows its search after each observation, which is why the token costs drop so sharply and why the ASR holds up on models that refuse reliably.
Why Static Benchmarks Are Now a Lower Bound
The benchmarking implications are the part that should concern safety teams, not just researchers.
Most frontier-lab safety evaluations measure attack success rate against a fixed corpus of harmful prompts: HarmBench, AdvBench, and their successors. The assumption baked into this methodology is that the prompt set bounds the adversary’s capability. If a model refuses nearly all items in a benchmark suite, it is reported as robust.
Metis demonstrates that this assumption is false when the adversary can adapt its strategy mid-attack. A model that refuses most fixed prompts may still fail against an adaptively optimized prompt, because the attacker is not drawing from the same distribution the benchmark tests.
The paper does not benchmark against every published attack method, and it would be inaccurate to claim universal superiority. What it demonstrates is that a well-structured adaptive policy consistently outperforms the specific baselines it was tested against (Crescendo, PAIR, and others) on frontier models, and that the performance gap widens as model-level defenses improve. Stronger refusal behavior penalizes stochastic search more than it penalizes directed search.
Industry Tooling Is Already Moving This Direction
The shift from static templates to adaptive policies is not confined to academic papers. Promptfoo’s red-team platform2 ships two agentic strategies that learn from attack history: Meta-Agent, which the vendor reports increases ASR by 70-90%2, and Hydra Multi-turn, which runs parallel adaptive attack threads. The mechanism differs from Metis (Promptfoo’s strategies operate within a vendor-specific toolchain rather than as a general POMDP solver), but the conceptual alignment is clear. Production red teams are already running adaptive adversaries against their models; the academic literature is catching up to the practice.
This convergence matters for anyone interpreting published safety numbers. If your red-team program only runs static prompt suites, you are testing against adversaries that production tooling has already surpassed. The gap between your published ASR and the actual attack surface is growing, and it is not growing slowly.
Defensive Responses: Deterministic Guardrails vs. Adaptive Alignment
On the defensive side, ClawGuard3 takes a different architectural bet. Rather than trying to make the model itself harder to jailbreak through alignment training, ClawGuard enforces deterministic rules at every tool-call boundary. It blocks indirect prompt injection by validating that tool outputs conform to expected schemas before the model processes them, without relying on the model’s learned refusal behavior.
The appeal is obvious: a model that can be jailbroken 78%1 of the time on a direct prompt is still constrained if its tool calls are gated by a deterministic layer it cannot modify. The limitation is that ClawGuard addresses the tool-augmented attack surface specifically, not the direct-prompt jailbreak surface that Metis targets. The two threats are adjacent but not identical.
Neither approach is sufficient alone. Alignment-based defenses degrade under adaptive pressure; deterministic guardrails only cover the boundaries they are explicitly coded to enforce. A reasonable posture for teams shipping LLM products would layer both, and test the combination against adaptive adversaries, not static benchmarks.
Practical Takeaways for Security Teams
Three concrete implications for teams running red-team programs on LLM deployments.
First, if your safety evaluation pipeline runs a fixed prompt corpus and reports ASR, that number is a lower bound. Fold an adaptive attacker into your testing pipeline. You do not need to replicate Metis exactly; Promptfoo’s Meta-Agent strategy or a similar toolchain-level approach will surface a more realistic ASR.
Second, the cost of running adaptive red-team evaluations has dropped. Metis reports 8.2x lower token costs than stochastic baselines because directed search wastes fewer turns. The excuse that adaptive testing is too expensive to run regularly is getting weaker.
Third, pay attention to what surface you are actually defending. Direct-prompt jailbreaks and indirect prompt injection through tool calls are different threats with different mitigations. A model that refuses most direct harmful prompts but processes unvalidated tool output is not “safe” by any coherent definition. It has two distinct attack surfaces, and your testing should cover both under adaptive conditions.
Frequently Asked Questions
Does the self-evolving optimization approach extend beyond jailbreaking?
The same metacognitive policy loop is appearing in other domains. ACE (arXiv
.16299), also from 2026, applies adversarial preference optimization to code generation — evolving unit tests to break LLM-written code rather than evolving prompts to break safety filters. Both replace stochastic search with directed, feedback-driven policy updates, but the reward signals and action spaces are domain-specific, not transferable out of the box.Where does the POMDP formulation break down?
Metis assumes the target model returns observable refusal text the evaluator can diagnose. Against targets that silently truncate, return generic errors, or refuse without explanatory content, the semantic gradient signal degrades. The paper also does not evaluate against models that randomize their refusal behavior across identical queries, which would inject noise into the attacker’s belief-state updates and slow convergence.
What budget should teams allocate for adaptive red-teaming after Metis?
At 8.2x lower token costs than stochastic baselines, a Metis-style adaptive evaluation on a frontier model is roughly an order of magnitude cheaper than equivalent-coverage stochastic runs. Teams previously priced out of multi-turn adaptive testing can now feasibly integrate it as a CI gate — the bottleneck shifts from API costs to human review time for adjudicating borderline outputs.
What happens when both attacker and defender use adaptive strategies?
Deterministic guardrails like ClawGuard don’t degrade under adaptive pressure because they validate schema conformance at tool-call boundaries rather than relying on model behavior. But they are brittle to attack vectors outside their explicitly coded rules. The near-term equilibrium likely layers both approaches — deterministic boundaries for well-characterized threats and adaptive monitoring for novel patterns — though no published work yet evaluates Metis against a ClawGuard-gated target.