Stopping Multi-Turn LLM Jailbreaks Without Retraining the Model

Multi-turn jailbreaks, where an attack is distributed across several conversational turns rather than stuffed into a single prompt, achieve success rates above 70% against models hardened for single-turn safety, according to a study published in 2025. THRD (arXiv:2606.01738), released June 1, 2026, proposes a defense that sits entirely outside the model weights. It works at inference time, which means it can be deployed against any LLM accessible via API, including those whose weights you cannot touch.

The problem: safety breaks across turns

Most deployed LLM safety mechanisms evaluate one prompt at a time. That works when the harmful request is self-contained. It fails when the attacker builds context over multiple exchanges, each turn pushing the model’s conditioning slightly closer to compliance. THRD’s authors report that over 70% of multi-turn attacks require Turn 2 or later to detect, based on their analysis in the THRD paper’s evaluation section. The implication is straightforward: any defense that evaluates each turn in isolation will miss the majority of multi-turn attacks.

The attack-side data reinforces this. The Crescendo method achieves up to 98% success against GPT-4 by gradually escalating conversational context. The same study argues that multi-turn jailbreaking is roughly equivalent to resampling single-turn attacks multiple times, which raises a question about whether specialized multi-turn defenses are addressing a genuinely new threat vector or simply a higher-volume version of an old one. More on that tension below.

THRD’s architecture: four modules, one temporal score

THRD decomposes multi-turn defense into four modules that run on each conversational turn:

Turn-level Risk Assessor (TRA). Scores the current user turn for harmful intent using a prompted LLM. The paper experiments with three prompt calibration variants: Variant A provides explicit guidance on high-risk structural patterns, Variant B requires the scorer to generate both benign and harmful interpretations before scoring, and Variant C focuses on request structure rather than keyword matching (THRD paper, Section 3). Calibration matters here because an overly sensitive TRA propagates false alarms downstream, while an insensitive one lets early attack signals through.
Historical Context Analyzer (HCA). Aggregates the risk trajectory across all prior turns. This is where THRD’s core theoretical contribution lives: the authors argue that safety behavior in multi-turn interaction is trajectory-dependent, meaning earlier turns, including the model’s own prior responses, establish contextual momentum that progressively lowers the barrier to harmful compliance (THRD paper, Section 3).
Response Evaluator (RE). Examines the model’s actual response for harmful content, providing a third signal that can catch attacks that slip through intent analysis alone.
Decision Module. Combines the three signals using time-evolving scoring with attenuation-based modulation and trend-aware adjustment. The attenuation mechanism presumably downweights older turns, while trend-aware adjustment detects accelerating risk trajectories, though the paper does not publish the exact weighting formula.

Why training-free matters

Weight-level safety approaches, including RLHF and safety fine-tuning, require access to model internals. That rules out their use on closed, API-only models: Claude, GPT-4o, Gemini, and every other model where the weights live on someone else’s infrastructure. THRD’s training-free design means the defense wraps around the model at the application layer, intercepting inputs and outputs without modifying the model itself.

D-Judge (arXiv:2606.02640), accepted at ICML 2026, addresses the same multi-turn jailbreak problem but takes a weight-dependent approach: it rewrites the victim LLM’s responses before an attacker’s automated judge evaluates them, misaligning the feedback signal that multi-turn attacks rely on. The tradeoff is that D-Judge requires supervised fine-tuning and direct preference optimization, which confines it to models you can actually fine-tune. No head-to-head comparison between THRD and D-Judge appears in either paper as of June 2026.

The numbers: strong on paper, awaiting reproduction

According to the THRD paper’s results section, the framework reduces attack success rate (ASR) to 0.2, 4.0% against state-of-the-art multi-turn attacks, including tree-search-based and multi-agent collaborative methods, evaluated across two target models. Model utility, measured on MMLU and GSM8K, degrades by no more than 1.5%.

The three TRA calibration variants show clear performance differences in the paper’s ablation tables, suggesting that the quality of the initial risk assessment propagates significantly through the rest of the pipeline. The paper does not appear to break down ASR by attack method in a way that would reveal whether certain attack categories (e.g., multi-agent collaborative) are harder to defend against than others.

A wrinkle: maybe multi-turn attacks aren’t special

The concurrent empirical study on multi-turn jailbreaks complicates the framing. Its authors argue that multi-turn attack success is largely explained by repeated sampling: if you get enough attempts at a single-turn jailbreak, you eventually succeed, and multi-turn conversation simply provides those attempts within a single session. They also found that attack success is correlated among similar models, meaning a jailbreak that works on one model in a family is likely to work on its siblings, and that reasoning models with higher compute effort often show higher attack success rates.

If multi-turn jailbreaking is statistically equivalent to resampling, then a defense optimized for temporal risk accumulation may be solving a problem that rate-limiting and retry budgets could address more cheaply. THRD’s authors would presumably counter that the trajectory-dependent conditioning effect is real and not reducible to resampling statistics, but the paper does not include a controlled experiment isolating that variable. The debate is not settled.

What practitioners should take away

For teams deploying LLMs behind APIs they do not control, inference-time defense is the only option. THRD offers a structured approach: per-turn risk scoring, historical aggregation, and response evaluation feeding into a composite decision. The architecture is sound in principle, the reported numbers are competitive, and the training-free constraint is a genuine practical advantage.

The unresolved questions are the ones that always matter at deployment time. What does a three-LLM-call-per-turn overhead look like in latency and cost at production scale? How many legitimate multi-turn conversations get flagged as attacks? And does the defense hold up against attack methods not included in the authors’ benchmark suite? None of these have answers yet.

Frequently Asked Questions

What does running three extra LLM calls per user turn cost in inference dollars?

If the TRA, HCA, and RE modules each call an LLM comparable to the target model, a single defended turn consumes roughly four times the inference compute of an undefended one. For a fleet processing 10,000 turns per minute on a model charging $3 per million input tokens, the defense layer alone adds approximately $9 per million tokens of conversational input, before accounting for the Decision Module’s non-LLM overhead. The paper does not evaluate whether smaller, cheaper models can power the scoring modules without degrading detection rates.

Can an attacker adapt to THRD’s scoring over repeated sessions?

THRD’s risk assessor is itself a prompted LLM, so it inherits the same vulnerability class it is designed to defend against. An adversary who knows the defense is in place could craft turns that score low on the TRA’s calibration variant while still advancing a cumulative attack trajectory. The paper evaluates fixed attack suites but does not test adversarial adaptivity, where an attacker iteratively probes the defense and adjusts strategy across sessions. Prompt-based guardrails provide no formal robustness guarantee, unlike verified safety classifiers with published error bounds.

Could per-session retry caps achieve similar protection without the scoring layer?

The concurrent empirical study argues that multi-turn attack success is statistically similar to repeated single-turn sampling. If that model is accurate, capping turns per session or throttling flagged users could suppress attack success with zero additional inference cost. The unresolved question is whether THRD’s trajectory-dependent conditioning effect, where the model’s own prior responses erode the compliance barrier, operates through a mechanism that retry counting cannot capture. No controlled experiment in either paper isolates that variable.

Why are reasoning models more susceptible to multi-turn jailbreaks?

The concurrent study found that models allocating more compute to chain-of-thought or planning steps show higher jailbreak success rates under multi-turn attack. One hypothesis: extended reasoning gives the attacker’s contextual framing more intermediate tokens to influence, increasing the surface for gradual compliance drift. If confirmed, this creates a direct cost tension for teams deploying reasoning-enhanced models: stronger reasoning may demand stronger (and more expensive) inference-time defense, compounding the already high per-query price of extended thinking.