MultiBreak[^1], a new benchmark from Microsoft Research and Simon Fraser University, demonstrates that single-turn refusal rates substantially underestimate how easily large language models can be jailbroken in multi-turn conversations. The dataset achieves a 54 percentage-point lift in attack success on DeepSeek-R1-7B[^1] over prior work, forcing a reconsideration of how safety teams measure deployed risk.
What MultiBreak Measures (and What It Doesn’t)
MultiBreak[^1] is a static benchmark of 10,389 multi-turn adversarial prompts spanning 2,665 distinct harmful intents. The prompts are pre-scripted: each turn is generated in advance by fine-tuned open-source models rather than adapted in real time to the victim’s responses. This is a deliberate scope choice. The authors want to measure how much vulnerability a fixed, reusable corpus can surface, not how an adaptive attacker performs in a live conversation.
That distinction matters because the current state of the art in multi-turn attacks is increasingly interactive. Mastermind[^2], a knowledge-driven method published earlier this year, achieves 87% attack success on HarmBench by reasoning over victim responses and refining its approach turn by turn. MultiBreak’s prompts do not do this. They are generated by iteratively fine-tuning LLaMA3-8B, Qwen2.5-7B, and DeepSeek-Distill-Qwen-14B against target models in an active-learning loop, then frozen.[^1] The result is a corpus that tests whether a model will break when fed a specific, replayable sequence of queries.
How the Active-Learning Pipeline Works
The dataset is built through an active-learning pipeline that treats prompt generation as an optimization problem. The authors start with open-source generator models and fine-tune them iteratively against victim models, selecting which prompts to keep using an uncertainty-based composite acquisition function. In practice, this means the pipeline generates candidate prompts, scores them by how much the generator is uncertain about their effectiveness, and retains the ones that appear most informative. Over iterations, the corpus expands from seed prompts toward the 10,389-turn final set.[^1]
The approach is practical. Because the generators are open-source models and the victims include both open- and closed-weight APIs, the pipeline demonstrates that attacks trained on cheap local models transfer to commercial endpoints. The authors frame this as cost efficiency: an attacker can build the corpus without paying for victim API calls. But it also raises the possibility that the open-source generators and the closed-source targets share enough training data that the attack surface overlaps, though the paper does not test this hypothesis directly.
The Numbers: 54 pp on DeepSeek-R1-7B, 34.6 pp on GPT-4.1-mini
Judged by LLaMA Guard, MultiBreak[^1] reaches 83.3% attack success rate on DeepSeek-R1-7B, compared to 29.3% for the next-best baseline dataset, MHJ. That is a 54.0 percentage-point gap. On GPT-4.1-mini, the benchmark scores 74.8% versus MHJ’s 40.2%, a 34.6 pp improvement.[^1]
These are the headline figures, and they are large enough to be uncomfortable. But they are specific to the LLaMA Guard judge. When the same experiments are scored by GPT-4o-mini, the gaps collapse to 21.8 pp on DeepSeek-R1-7B and 10.3 pp on GPT-4.1-mini.[^1] The ordering of models does not change, but the magnitude of the safety problem depends heavily on who is counting.
Why Judge Choice Changes the Headline
The judge problem is not new, but MultiBreak surfaces it starkly. LLaMA Guard[^1] (specifically LLaMA-3.1-8B) reports roughly 60% false positives on DeepSeek-R1-7B, meaning it flags benign outputs as harmful at a high rate. GPT-4o-mini is more conservative, with lower false positives, but disagrees more with other judges on some models. The result is that ASR comparisons between papers are only valid when the judge is held constant, yet the literature rarely does this.
A contemporaneous paper, MT-JailBench[^3], makes the point explicit: judge choice can shift measured ASR by 43 to 69 percentage points. The authors conclude that ASR “partly measures how much search the attacker was allowed to perform.”[^3] This means the headline numbers in any jailbreak paper are a joint function of the attack corpus, the victim model, the judge model, and the interaction budget. Changing any one variable changes the story.
The Single-Turn ASR Blind Spot
The deeper finding is that categories which look safe under single-turn evaluation can fail dramatically in multi-turn scenarios. MultiBreak’s[^1] Figure 6 shows ASR increments of up to 44.8% for low-vulnerability single-turn categories when extended to six turns. A model that refuses a harmful request outright may comply when the same intent is approached obliquely across several exchanges.
This undermines a common safety pipeline assumption: that a high single-turn refusal rate implies strong real-world protection. Single-turn corpora are cheaper to run and easier to audit, so they dominate published evaluations. But conversational models are deployed in chat interfaces where users have context windows, not query boxes. The gap between single-turn headline ASR and multi-turn vulnerability is not a marginal error; it is the difference between a model that looks defensible and one that is not.
Limitations and Contested Claims
The authors acknowledge that their prompts are pre-scripted rather than generated interactively with victim responses. This means MultiBreak does not capture the full threat model of adaptive attacks like Crescendo or Mastermind, which refine their strategy based on what the victim actually says. Whether static or dynamic attacks are more representative of real-world misuse is an open question; the benchmark answers only the static case.
There is also the transfer question. The pipeline uses open-source generators to attack closed-source APIs. The authors frame this as cost efficiency: an attacker can build the corpus without paying for victim API calls. But it also raises the possibility that the attacks exploit shared training-data artifacts rather than genuine reasoning vulnerabilities. If the generator and victim were trained on overlapping refusal data, the attack may be memorizing bypass patterns rather than discovering novel failure modes. The paper does not distinguish between these two explanations.
Finally, the interaction budget remains underspecified in public discourse. MT-JailBench’s[^3] observation that ASR measures allowed search depth implies that benchmarks must fix turn counts to be comparable. MultiBreak uses a specific budget, but the field has not converged on a standard.
What Safety Teams Should Do Differently
Safety teams currently treat single-turn refusal rate as the headline safety metric. MultiBreak shows this is insufficient. Teams should rebuild evaluation pipelines around multi-turn corpora, fix interaction budgets when comparing results, and report ASR with the judge model named explicitly alongside the victim. A table that shows LLaMA Guard and GPT-4o-mini side by side is more useful than either number alone.
The benchmark also provides a template. Its active-learning pipeline, iterative fine-tuning against victim models with uncertainty-based filtering, is a reproducible method for generating future corpora. Safety teams can adapt the same pipeline to their own models, generating adversarial datasets that reflect their specific deployment context rather than relying on generic public benchmarks.
The broader implication is for alignment claims. Any safety report that cites single-turn ASR as the primary defense metric is understating exposure. The number that matters is not how often a model refuses a direct request in isolation, but how often it refuses the same intent across a realistic conversation. MultiBreak makes that number measurable, and it is much lower than the single-turn figures suggest.
Frequently Asked Questions
How does MultiBreak differ from X-Teaming’s multi-turn approach?
X-Teaming deploys adaptive multi-agent teams that revise strategy based on live victim responses, achieving roughly 70% ASR on HarmBench. MultiBreak inverts this: it uses no real-time adaptation at all, yet reaches 83.3% on DeepSeek-R1-7B under LLaMA Guard, demonstrating that a well-constructed static corpus can match or exceed adaptive methods depending on the model-judge pair.
What turn count should teams standardize on for multi-turn safety evals?
No standard exists yet. MultiBreak evaluates up to six turns; MT-JailBench decomposes attacks into modular phases with variable lengths, and its core finding is that ASR scales with allowed search depth. A practical reporting convention is to publish refusal rates at multiple fixed depths (e.g., 3, 6, 10 turns) so stakeholders can see where the model’s defenses degrade rather than relying on a single headline number.
Could the open-source generators be exploiting shared training data with commercial targets?
The paper flags this possibility but does not test it. A concrete diagnostic would be to generate attacks from a model outside the typical English-web training lineage—for instance, a model trained predominantly on non-English corpora—and measure whether transfer rates to GPT-4.1-mini hold. If they drop sharply, the current results likely reflect memorized refusal bypasses shared across training distributions rather than reasoning-level vulnerabilities.
What happens if adaptive methods like Mastermind are combined with MultiBreak-scale corpora?
Mastermind already achieves 87% on HarmBench by reasoning over each victim response in real time, outperforming the static X-Teaming (70%). Scaling that interactive approach to MultiBreak’s 10,000+ prompt coverage would require generating each conversation on the fly rather than replaying a frozen corpus, raising compute costs substantially. The unresolved tradeoff is between reproducibility (static replay) and ecological validity (live adaptation), and no current benchmark spans both.