Safety alignment in large language models does not improve monotonically across releases. A preprint submitted 30 May 2026 shows that Google’s Gemma 3 (12B) is substantially more vulnerable to automated adversarial attacks than its predecessor Gemma 2, with misinformation attack success rates climbing from 29% to 99%. The finding challenges the operating assumption that each model generation is safer than the last, and it implies red teams cannot skip re-running prior attack suites against new checkpoints.
Automated red-teaming across four Gemma generations
The paper applies MAP-Elites, an evolutionary search algorithm, to adversarial prompt generation across four generations of Google’s open-weight Gemma family (models spanning 7B to 31B parameters). MAP-Elites explores a space of attack strategies by mutating and recombining prompt variants, selecting for those that elicit harmful outputs. It is adaptive by design: the attack strategy evolves based on what works against the specific model under test, rather than drawing from a fixed corpus of known jailbreaks.
The search budget is small enough to reproduce on a single workstation, which is part of the point. [Updated June 2026] Each run executed 200 generations over a 6×6×8 grid of 288 behavioral niches, with three seeds per model and harmful behaviors drawn from the HarmBench taxonomy. That is 12 evolution runs and 36 transfer-replay evaluations in total across the four generations. The author is Subhadip Mitra, working solo; this is not a frontier-lab red-team effort with a six-figure compute bill, which makes the size of the Gemma 3 regression it surfaced harder to dismiss as noise.
Attack success rate (ASR) serves as the primary metric, evaluated by a self-hosted judge, Llama-3.1-8B-Instruct, that classifies responses as compliant or non-compliant with safety guidelines. The authors report confidence intervals and paired bootstrap significance tests. The experiment is longitudinal: the same probing methodology is applied across successive model releases, enabling direct comparison rather than single-snapshot evaluation.
Gemma 3 regresses on safety
The headline result is a non-monotonic safety trajectory. Gemma 3 (12B) achieved a 68.7% (± 5.7%) overall ASR under MAP-Elites, compared to 45.5% (± 7.2%) for Gemma 2 (p = 0.030, paired bootstrap). Gemma 4 dropped back to 33.9% (± 1.8%). Safety did not accumulate across releases. It regressed, then partially recovered.
The category-level breakdown sharpens the picture. Misinformation ASR jumped from 29% on Gemma 2 to 99% on Gemma 3, and remained elevated at 77% on Gemma 4. Copyright and cybercrime vulnerability sat near 100% ASR across all four generations under the 8B judge, though the authors note that a second-judge audit produced divergent results on copyright, making that finding sensitive to judge choice.
How sensitive is worth spelling out. [Updated June 2026] The second-judge audit ran Llama-3.1-70B-Instruct against the 8B judge on a 52-sample slice, and the two scorers agreed at Cohen’s κ of 0.15 on harm classification. That is barely above chance. A single LLM-as-judge is the load-bearing instrument behind every ASR figure in the paper, so weak inter-judge agreement is not a footnote; it widens the error bars on the whole table beyond the bootstrap intervals the authors report, which only capture sampling variance and not judge disagreement. The direction of the Gemma 3 regression is robust to this, since both judges see the same generation as the worst, but the precise magnitudes inherit the noise of whichever judge produced them. The reliability of the LLM judges that score these attacks is its own active research problem, and a κ of 0.15 is a concrete illustration of why the field has not settled it.
Gemma 4 improved over Gemma 3 on most metrics. The takeaway is not that safety is uniformly degrading. It is that it is not reliably accumulating.
The paper stops short of a causal story for the Gemma 3 dip, but it points at one structural change. [Updated June 2026] Gemma 3 was the generation in which Google added vision input and stretched the context window, and the paper’s own recommendation is that “architectural changes (such as adding multimodal capabilities) should trigger comprehensive safety re-evaluation.” Expanding the input surface expands the attack surface, and the alignment data for a text-only predecessor does not automatically cover the new modes. That is a hypothesis the single-author study cannot confirm with four data points, but it is the kind of correlation that should make any team shipping a multimodal upgrade re-run its full red-team suite rather than inheriting the prior generation’s safety report.
Old attacks on new models
The paper also tested whether adversarial prompts evolved against older Gemma generations transfer to newer ones. Attacks trained against Gemma 2 predecessors transferred to Gemma 3 at 44-46% ASR. The same attacks achieved only 14-18% ASR against Gemma 4, suggesting that Gemma 4’s safety improvements generalize beyond the specific attack distributions used during its alignment.
This is consistent with earlier work from llm-attacks.org, which demonstrated that adversarial suffix strings built against open-source LLMs transfer to closed-source models including ChatGPT, Bard, and Claude. That prior work noted that analogous adversarial attacks have resisted solutions in computer vision for a decade, and it remains unclear whether such attacks can ever be fully patched.
The threat model matters more for Gemma than for an API-gated model, because the weights ship. [Updated June 2026] Within weeks of Gemma 4’s release in early April 2026, the community published “abliterated” variants with the refusal direction surgically removed at the weight level, which is a far cheaper bypass than evolving an adversarial prompt. Transfer attacks and weight-level safety removal are separate vectors, but they compound: an open release that regresses on prompt-based ASR is also one whose safety layers can be stripped outright by fine-tuning. For anyone deploying Gemma behind their own product, the prompt-transfer numbers are a floor on the exposure, not a ceiling.
What static benchmarks miss
The paper states that the safety regressions it found are invisible to standard safety benchmarks. The mechanism is straightforward: static benchmarks evaluate against a fixed, known set of harmful prompts. MAP-Elites adaptively searches for new failure modes by evolving prompts based on what actually elicits harmful outputs from the model under test. The same quality-diversity approach has been shown to surface whole classes of jailbreak that fixed red-team suites never reach, which is why a passing benchmark score and a hardened model are not the same claim.
A model hardened against a benchmark’s prompt set can still be vulnerable to novel adversarial variants that the benchmark does not contain. This is the same well-known failure mode that makes static test suites unreliable in traditional software security: they verify that known failure modes are handled, not that unknown ones are.
The implication for vendors and auditors is that standard safety benchmarks provide a lower bound on safety, not a measurement of it. A model can pass every item in a benchmark suite and still be substantially more vulnerable than its predecessor to attacks the suite does not cover.
Shared representations, not shared alignment failures
A complementary study (arXiv:2506.12913), spanning 20 open-weight models and 33 jailbreak attacks, provides a mechanistic explanation for why transfer works at all. The authors found that jailbreak transferability is driven by representational similarity between models and by attack strength on the source model, not by artifacts of safety training. Persona-style attacks, which instruct the model to adopt a specific character, transfer far more reliably than cipher-based attacks that attempt to obscure harmful intent through encoding.
This is a useful distinction. If transfer were driven by shared safety-training failures, it might be fixable by diversifying the alignment procedure. If it is driven by shared representations in the base model, the problem is structural: any two models that learn similar internal representations of the same concepts will share adversarial vulnerabilities, regardless of how their safety training differs.
The study does not just correlate the two; it manipulates them. [Updated June 2026] The authors increased the representational similarity between two models through benign-only distillation, using no adversarial or harmful data at all, and watched transfer rates rise as a result. That is causal evidence rather than an observed association, and it raises an uncomfortable implication for the current direction of the field. Distillation from a small number of strong teacher models is how much of the open-weight ecosystem is built, so the industry is, in effect, manufacturing the representational overlap that makes one model’s jailbreak another model’s jailbreak. Safety-data diversity does not touch that overlap, because the overlap lives in the benign representations the models were optimized to share.
The regression-testing gap
The practical consequence is that every new model checkpoint requires a full re-run of the prior attack corpus, plus adaptive probing for novel failure modes. Treating safety as a one-way ratchet, where each release builds on the last, produces false confidence.
The paper’s own asks are procedural rather than technical. [Updated June 2026] It advocates that “model developers publish generational safety analyses alongside capability benchmarks” and that “the research community develop standardized protocols for cross-version safety comparison.” Both are reasonable and neither exists today. Vendors publish capability deltas between releases as a matter of course; safety deltas, when they appear at all, tend to be qualitative assurances rather than the same metric run against the same attack suite on the old and new checkpoint. Until a model card carries a like-for-like ASR comparison against its predecessor, a buyer cannot tell whether a new release fixed the previous one’s failures or simply moved them. The standard caveat applies as well: this is a single-author preprint with four data points per trend, not a peer-reviewed, multi-lab result, and the κ of 0.15 between judges means the magnitudes deserve more skepticism than the direction.
A separate evaluation of 10 open-source LLMs across 94 prompt-injection and 73 jailbreak scenarios (arXiv:2602.22242) found that lightweight inference-time defences handle straightforward attacks but are consistently bypassed by long, reasoning-heavy prompts. Layered defences help; they do not close the gap.
An ICML 2026 position paper raises a related structural concern: model-level governance becomes less effective when capability progress is driven by non-model gains such as inference scaling, systems-level orchestration, and data-asset improvements. Risk management that relies primarily on pre-deployment evaluation of a single model checkpoint misses the ways in which the surrounding system can amplify or suppress that model’s failure modes.
Taken together, these findings describe a testing regime that is more expensive than the industry has been operating under. Adaptive longitudinal probing is not optional if the goal is to catch regressions that static benchmarks miss. Cross-generation attack transfer means the attack corpus grows over time, not shrinks. And if transferability is rooted in shared representations rather than alignment artifacts, it cannot be engineered away by adjusting the safety-training recipe alone.
Frequently Asked Questions
Do these safety regression findings apply to closed-source models like GPT or Claude?
The paper tested only Google’s Gemma family (7B-31B parameters), but a separate 20-model transferability study (arXiv:2506.12913) found that representational similarity drives jailbreak transfer across vendors, not just within a single model line. The mechanism is vendor-agnostic, yet no published longitudinal study has tracked non-monotonic safety across GPT or Claude generations, so the pattern remains empirically unconfirmed outside Gemma.
What would a proper cross-generation regression test cost in practice?
The ICML 2026 position paper (arXiv:2606.00047) identifies three categories of non-model capability gains that undermine checkpoint-only evaluation: inference scaling, systems-level orchestration, and data-asset improvements. A regression suite that accounts for all three must re-probe not just the base model but the full deployment stack each time any component changes, multiplying the test surface beyond what single-checkpoint safety reports cover and pushing cost well above a benchmark rerun.
Why do persona-style jailbreaks transfer across models but cipher-based ones do not?
Persona attacks instruct the model to adopt a character, tapping into conversational representations that are structurally similar across most instruction-tuned models. Cipher-based attacks encode harmful intent through character-level substitutions that are representation-specific: a cipher that confuses one tokenizer will likely fail on another. The 20-model study found this gap is substantial, not marginal.
Could the 29% to 99% misinformation spike be a measurement artifact from the judge?
The authors flag the copyright and cybercrime findings as judge-sensitive after their second-judge audit, but they do not re-audit misinformation with a different judge. Any single-judge ASR measurement carries the same structural risk: a judge misaligned on a category will over- or under-count compliant responses. That risk is not hypothetical here, since the audit found the 8B and 70B judges agreed at only Cohen’s κ of 0.15 on harm classification, barely above chance. The misinformation spike may well be real, and both judges rank Gemma 3 as the worst generation, but the exact 29%-to-99% jump has not been cross-validated against a second judge the way the copyright numbers were.