MultiBreak Benchmark: 10,389 Multi-Turn Jailbreak Prompts Raise ASR 54pp on DeepSeek-R1-7B

MultiBreak¹, a benchmark from Microsoft Research and Simon Fraser University, demonstrates that single-turn refusal rates substantially underestimate how easily large language models can be jailbroken in multi-turn conversations. The dataset achieves a 54 percentage-point lift in attack success on DeepSeek-R1-7B¹ over prior work, forcing a reconsideration of how safety teams measure deployed risk. The work has since been accepted to ICML 2026, which clears the methodology through peer review but settles none of the open questions about what the headline number actually represents. [Updated June 2026]

What MultiBreak Measures (and What It Doesn’t)

MultiBreak¹ is a static benchmark of 10,389 multi-turn adversarial prompts spanning 2,665 distinct harmful intents. The prompts are pre-scripted: each turn is generated in advance by fine-tuned open-source models rather than adapted in real time to the victim’s responses. This is a deliberate scope choice. The authors want to measure how much vulnerability a fixed, reusable corpus can surface, not how an adaptive attacker performs in a live conversation.

That distinction matters because the current state of the art in multi-turn attacks is increasingly interactive. Mastermind², a knowledge-driven method published earlier this year, achieves 87% attack success on HarmBench by reasoning over victim responses and refining its approach turn by turn. MultiBreak’s prompts do not do this. They are generated by iteratively fine-tuning LLaMA3-8B, Qwen2.5-7B, and DeepSeek-Distill-Qwen-14B against target models in an active-learning loop, then frozen.¹ The result is a corpus that tests whether a model will break when fed a specific, replayable sequence of queries.

How the Active-Learning Pipeline Works

The dataset is built through an active-learning pipeline that treats prompt generation as an optimization problem. The authors start with open-source generator models and fine-tune them iteratively against victim models, selecting which prompts to keep using an uncertainty-based composite acquisition function. In practice, this means the pipeline generates candidate prompts, scores them by how much the generator is uncertain about their effectiveness, and retains the ones that appear most informative. Over iterations, the corpus expands from seed prompts toward the 10,389-turn final set.¹

The turn range is itself part of the design. MultiBreak’s conversations run from two to six turns, and the largest vulnerability gains land at the long end.¹ The acquisition function does not simply rank prompts by raw success; it weights how much a candidate reduces the generator’s uncertainty about the victim’s decision boundary, which biases the corpus toward sequences that probe the edge of a refusal policy rather than ones that already work. That is the difference between a benchmark of attacks that succeed and a benchmark of attacks that are informative about where a model is closest to breaking, and it is why a frozen corpus built this way can rival adaptive methods that get to react to the victim in real time. [Updated June 2026]

The approach is practical. Because the generators are open-source models and the victims include both open- and closed-weight APIs, the pipeline demonstrates that attacks trained on cheap local models transfer to commercial endpoints. The authors frame this as cost efficiency: an attacker can build the corpus without paying for victim API calls. But it also raises the possibility that the open-source generators and the closed-source targets share enough training data that the attack surface overlaps, though the paper does not test this hypothesis directly.

The Numbers: 54 pp on DeepSeek-R1-7B, 34.6 pp on GPT-4.1-mini

Judged by LLaMA Guard, MultiBreak¹ reaches 83.3% attack success rate on DeepSeek-R1-7B, compared to 29.3% for the next-best baseline dataset, MHJ. That is a 54.0 percentage-point gap. On GPT-4.1-mini, the benchmark scores 74.8% versus MHJ’s 40.2%, a 34.6 pp improvement.¹

These are the headline figures, and they are large enough to be uncomfortable. But they are specific to the LLaMA Guard judge. When the same experiments are scored by GPT-4o-mini, the gaps collapse to 21.8 pp on DeepSeek-R1-7B and 10.3 pp on GPT-4.1-mini.¹ The ordering of models does not change, but the magnitude of the safety problem depends heavily on who is counting.

Why Judge Choice Changes the Headline

The judge problem is not new, but MultiBreak surfaces it starkly. LLaMA Guard¹ (specifically LLaMA-3.1-8B) reports roughly 60% false positives on DeepSeek-R1-7B, meaning it flags benign outputs as harmful at a high rate. GPT-4o-mini is more conservative, with lower false positives, but disagrees more with other judges on some models. The result is that ASR comparisons between papers are only valid when the judge is held constant, yet the literature rarely does this.

A contemporaneous paper, MT-JailBench³, makes the point explicit: judge choice can shift measured ASR by 43 to 69 percentage points. The authors conclude that ASR “partly measures how much search the attacker was allowed to perform.”³ This means the headline numbers in any jailbreak paper are a joint function of the attack corpus, the victim model, the judge model, and the interaction budget. Changing any one variable changes the story.

Where MultiBreak Sits Among Multi-Turn Attacks

MultiBreak is not the first benchmark to argue that single-turn evaluation flatters a model’s safety, and the judge-sensitivity result it reports is one instance of a broader reckoning over how attack success rate gets measured. The same instability appears whenever researchers vary the scorer instead of the attack: a compliant output counted as a success by one classifier reads as a refusal under another, and the published ASR moves by tens of points while the model under test never changes. Groundy has covered why attack success rate misleads as a headline metric and how unreliable the LLM judges scoring these attacks actually are. MultiBreak’s 60% LLaMA Guard false-positive rate on DeepSeek-R1-7B¹ is a concrete case of that failure, not a quirk of one paper. [Updated June 2026]

What MultiBreak contributes is scale on the corpus side rather than the scorer side. Where Mastermind² and X-Teaming push on adaptivity by generating each turn against the live victim, MultiBreak pushes on coverage: 2,665 intents,¹ a two-to-six-turn span, and a frozen set any team can replay without an attack agent in the loop. The two axes are complementary. A static corpus that any lab can re-run is the only way to compare models on equal footing, because an adaptive attacker introduces a search budget that, as MT-JailBench³ shows, can dominate the result. The cost is ecological validity: a real adversary does adapt, and a frozen sequence cannot.

There is a quieter consequence for anyone treating this as a leaderboard. A fixed public corpus is a target, and a target this large will be trained against. Once 10,389 prompts¹ are released, the next round of safety-tuned models can fine-tune on them, and the reported ASR will fall for reasons unrelated to better refusal behavior. Static benchmarks decay this way; adaptive attacks do not. That asymmetry is the structural argument for keeping both alive, even though the static one is far cheaper to run and the adaptive one is the better proxy for a motivated attacker.

The deeper finding is that categories which look safe under single-turn evaluation can fail dramatically in multi-turn scenarios. MultiBreak’s¹ Figure 6 shows ASR increments of up to 44.8% for low-vulnerability single-turn categories when extended to six turns. A model that refuses a harmful request outright may comply when the same intent is approached obliquely across several exchanges.

This undermines a common safety pipeline assumption: that a high single-turn refusal rate implies strong real-world protection. Single-turn corpora are cheaper to run and easier to audit, so they dominate published evaluations. But conversational models are deployed in chat interfaces where users have context windows, not query boxes. The gap between single-turn headline ASR and multi-turn vulnerability is not a marginal error; it is the difference between a model that looks defensible and one that is not.

Limitations and Contested Claims

The authors acknowledge that their prompts are pre-scripted rather than generated interactively with victim responses. This means MultiBreak does not capture the full threat model of adaptive attacks like Crescendo or Mastermind, which refine their strategy based on what the victim actually says. Whether static or dynamic attacks are more representative of real-world misuse is an open question; the benchmark answers only the static case.

There is also the transfer question. The pipeline uses open-source generators to attack closed-source APIs. The authors frame this as cost efficiency: an attacker can build the corpus without paying for victim API calls. But it also raises the possibility that the attacks exploit shared training-data artifacts rather than genuine reasoning vulnerabilities. If the generator and victim were trained on overlapping refusal data, the attack may be memorizing bypass patterns rather than discovering novel failure modes. The paper does not distinguish between these two explanations, and the ICML 2026 acceptance does not change that: peer review validated the pipeline and the measurement, not the claim that the transferred attacks reflect reasoning-level vulnerabilities rather than shared training-data artifacts. That question is still open, and the diagnostic for it lives in the FAQ below. [Updated June 2026]

Finally, the interaction budget remains underspecified in public discourse. MT-JailBench’s³ observation that ASR measures allowed search depth implies that benchmarks must fix turn counts to be comparable. MultiBreak uses a specific budget, but the field has not converged on a standard.

What Safety Teams Should Do Differently

Safety teams currently treat single-turn refusal rate as the headline safety metric. MultiBreak shows this is insufficient. Teams should rebuild evaluation pipelines around multi-turn corpora, fix interaction budgets when comparing results, and report ASR with the judge model named explicitly alongside the victim. A table that shows LLaMA Guard and GPT-4o-mini side by side is more useful than either number alone.

On the defense side, the implication is sharper than a reporting convention. Input-level filters tuned on single-turn corpora will under-trigger on exactly the oblique, multi-step approaches MultiBreak rewards, because the harmful intent is distributed across turns that each look innocuous in isolation. The harder question is whether the fix belongs at inference time or in the weights. Runtime approaches that aim at stopping multi-turn jailbreaks without retraining the model try to catch the drift across a conversation as it happens, which is cheaper to deploy but easier to evade once an attacker knows a detector is watching. Weight-level alignment is harder to circumvent but expensive to redo per model and per policy change. Either way, a defense validated only against single-turn red-team sets inherits MultiBreak’s blind spot: it will report a refusal rate the deployed chat surface does not honor. [Updated June 2026]

The benchmark also provides a template. Its active-learning pipeline, iterative fine-tuning against victim models with uncertainty-based filtering, is a reproducible method for generating future corpora. Safety teams can adapt the same pipeline to their own models, generating adversarial datasets that reflect their specific deployment context rather than relying on generic public benchmarks.

The broader implication is for alignment claims. Any safety report that cites single-turn ASR as the primary defense metric is understating exposure. The number that matters is not how often a model refuses a direct request in isolation, but how often it refuses the same intent across a realistic conversation. MultiBreak makes that number measurable, and it is much lower than the single-turn figures suggest.

Frequently Asked Questions

How does MultiBreak differ from X-Teaming’s multi-turn approach?

X-Teaming deploys adaptive multi-agent teams that revise strategy based on live victim responses, achieving roughly 70% ASR on HarmBench. MultiBreak inverts this: it uses no real-time adaptation at all, yet reaches 83.3% on DeepSeek-R1-7B under LLaMA Guard, demonstrating that a well-constructed static corpus can match or exceed adaptive methods depending on the model-judge pair.

What turn count should teams standardize on for multi-turn safety evals?

No standard exists yet. MultiBreak evaluates up to six turns; MT-JailBench decomposes attacks into modular phases with variable lengths, and its core finding is that ASR scales with allowed search depth. A practical reporting convention is to publish refusal rates at multiple fixed depths (e.g., 3, 6, 10 turns) so stakeholders can see where the model’s defenses degrade rather than relying on a single headline number.

Could the open-source generators be exploiting shared training data with commercial targets?

The paper flags this possibility but does not test it. A concrete diagnostic would be to generate attacks from a model outside the typical English-web training lineage, for instance, a model trained predominantly on non-English corpora, and measure whether transfer rates to GPT-4.1-mini hold. If they drop sharply, the current results likely reflect memorized refusal bypasses shared across training distributions rather than reasoning-level vulnerabilities.

What happens if adaptive methods like Mastermind are combined with MultiBreak-scale corpora?

Mastermind already achieves 87% on HarmBench by reasoning over each victim response in real time, outperforming the static X-Teaming (70%). Scaling that interactive approach to MultiBreak’s 10,000+ prompt coverage would require generating each conversation on the fly rather than replaying a frozen corpus, raising compute costs substantially. The unresolved tradeoff is between reproducibility (static replay) and ecological validity (live adaptation), and no current benchmark spans both.