Evolving jailbreaks instead of finding them
Most LLM red-teaming works by throwing a fixed set of attacks at a model and checking what sticks. A paper accepted at the ICLR 2026 Workshop on Agents in the Wild tries the opposite: instead of searching for one high-success exploit, it evolves a diverse population of them. The result is a map of structurally distinct failure modes across four production models, not a single attack score. If the approach generalizes, it exposes a coverage gap in how vendors certify models as “safety-tested.”
The mechanism: MAP-Elites for adversarial search
The paper applies MAP-Elites, a quality-diversity evolutionary algorithm, to LLM jailbreak generation. Rather than optimizing for a single attack with the highest success rate, the algorithm maintains an archive of attacks spread across behavioral dimensions: strategy type (direct, hypothetical, multi-turn framing), encoding method (plain text, ROT13, Leetspeak), and prompt length. Each cell in the archive holds the fittest attack discovered for that combination of traits.
This matters because of what it produces: an interpretable taxonomy of attack classes, not a bag of token-level perturbations. The attacks are semantically readable strategies, which makes them auditable in a way that gradient-based adversarial suffixes are not.
Why existing red-team approaches leave gaps
The paper identifies three specific coverage problems in current adversarial testing:
- Manual red-teaming does not scale. Human testers can reason about novel attack vectors, but the combinatorial space of model behavior grows faster than any team can probe it.
- LLM-as-attacker methods exhibit mode collapse. When one LLM is prompted to attack another, it tends to converge on a small number of high-success strategies, leaving swaths of the vulnerability space unexplored.
- Gradient-based approaches produce uninterpretable outputs. Token-level perturbations that fool a classifier but look like gibberish to a human reviewer are hard to reason about as failure modes.
The quality-diversity approach is designed to address all three: it scales via automation, explicitly penalizes convergence on similar strategies, and operates at the semantic level so that every attack in the archive is human-readable.
Model-specific vulnerability profiles
The framework was tested on GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Devstral-small-2. The results diverge sharply by model:
| Model | Highest-fitness attack type | Max fitness |
|---|---|---|
| GPT-4o-mini | Hypothetical + multi-turn framing, ROT13 encoding | 0.8 |
| Gemini 2.0 Flash | Direct attacks with ROT13; multi-turn with Leetspeak | 0.8 |
| Claude 3.5 Sonnet | No clear weakness; uniformly ambiguous responses | 0.4 |
| Devstral-small-2 | (Detailed results not in the abstract) | — |
The divergent profiles are themselves the finding. GPT-4o-mini and Gemini fall to structurally different attack classes with comparable fitness (0.8), which means a safety benchmark calibrated against one model’s weaknesses would not reliably predict another’s. A fixed test suite that catches hypothetical-framing attacks on GPT-4o-mini says nothing about direct-request vulnerabilities on Gemini.
From single exploits to attack populations
Prior work supports the diversity argument. A November 2024 study (arXiv:2411.04223) found that diversity-based jailbreak techniques achieved up to 62.83% higher success rates against ten leading chatbots while using only 12.9% of the queries, suggesting that existing safety training may mask vulnerabilities rather than eliminate them.
The MAP-Elites approach extends that logic from “diversity helps” to “here is a systematic method for mapping the diversity.” The archive structure means that every time the algorithm runs, it fills in cells across the behavioral grid rather than re-discovering the same high-success attack. Over iterations, the coverage map grows, and previously unknown failure modes surface.
What this means for safety evaluation frameworks
Current safety evaluation frameworks, including the NIST AI Risk Management Framework, OWASP GenAI Red Teaming Guide, MITRE ATLAS, and CSA Agentic AI Red Teaming, rely on test suites of known attack patterns. The QD-evolution approach does not invalidate these suites, but it exposes a property they lack: none of them explicitly optimize for coverage of the space of possible attack classes.
The paper’s framing, which the authors themselves advance, is that passing a fixed benchmark provides limited assurance about a model’s exposure to the next evolved class of attack. That is an inferential bridge from “finds diverse attacks” to “benchmarks understate risk,” not a directly measured downstream result. The distinction matters. The paper demonstrates that QD evolution surfaces attack classes that static suites miss; it does not measure whether those missed attacks correspond to real-world exploitation patterns or whether closing the coverage gap reduces deployment risk. The operational takeaway is narrower but still consequential: safety evaluation pipelines that do not incorporate diversity-aware adversarial search have an unmeasured coverage gap, and the size of that gap is unknown.
For practitioners building internal safety audits or evaluating vendor claims, the practical implication is to ask not just “did the model pass the red-team suite?” but “what portion of the attack space did the suite cover?” The MAP-Elites archive provides one way to start answering that second question. Whether coverage translates to real-world safety improvement remains an open research question.
Frequently Asked Questions
How does MAP-Elites differ from the Zhao et al. diversity-based jailbreak study?
Zhao et al. (arXiv:2411.04223) demonstrated higher attack success rates through diversity but did not use MAP-Elites and did not produce a structured, interpretable taxonomy of attack classes. Their finding was empirical: diversity helps. This paper provides a repeatable mechanism that fills distinct behavioral cells in a grid (strategy type, encoding, prompt length) rather than re-discovering the same high-success exploit across runs.
The fitness scores top out at 0.8. Does that represent a hard safety limit on these models?
The 0.8 ceiling reflects the specific harm categories the authors selected for evaluation, not a fundamental bound on what the models will produce. Different harm categories could shift which behavioral cells yield high fitness and could change the inter-model ranking entirely. A model that scores 0.4 under one harm taxonomy might score substantially higher under another, which is why the paper’s own authors frame their results as coverage findings rather than robustness certifications.
Do these vulnerability profiles transfer to newer models like Claude 4 or GPT-5?
No. The profiles are pinned to the tested versions (GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash). Newer models receive different safety training, which shifts which cells in the MAP-Elites archive produce high fitness. The durable contribution is the methodology: run the evolutionary search against whichever model you are deploying, because the vulnerability map is model-specific and will not transfer across version boundaries.
How would a team add quality-diversity search to a safety audit built on NIST AI RMF or OWASP?
Both frameworks define catalogs of known attack patterns to test against, but neither specifies how to measure whether the catalog covers enough of the vulnerability space. A team could run MAP-Elites as a coverage probe alongside their existing suite: cells in the archive that remain empty after the static suite runs represent untested attack combinations. The tradeoff is compute: the evolutionary loop requires repeated generations of attack generation and model scoring, which is more expensive than running a fixed checklist but cheaper than a sustained manual red-team engagement covering the same behavioral dimensions.