groundy
ethics, policy & safety

Newer LLMs Aren't Always Safer: Adversarial Attacks Transfer Across Model Generations

Gemma 3 is more vulnerable to adversarial attacks than Gemma 2, with misinformation rates leaping from 29% to 99%. Safety does not reliably accumulate across model releases.

6 min · · · 5 sources ↓

Safety alignment in large language models does not improve monotonically across releases. A preprint submitted 30 May 2026 shows that Google’s Gemma 3 (12B) is substantially more vulnerable to automated adversarial attacks than its predecessor Gemma 2, with misinformation attack success rates climbing from 29% to 99%. The finding challenges the operating assumption that each model generation is safer than the last, and it implies red teams cannot skip re-running prior attack suites against new checkpoints.

Automated red-teaming across four Gemma generations

The paper applies MAP-Elites, an evolutionary search algorithm, to adversarial prompt generation across four generations of Google’s open-weight Gemma family (models spanning 7B to 31B parameters). MAP-Elites explores a space of attack strategies by mutating and recombining prompt variants, selecting for those that elicit harmful outputs. It is adaptive by design: the attack strategy evolves based on what works against the specific model under test, rather than drawing from a fixed corpus of known jailbreaks.

Attack success rate (ASR) serves as the primary metric, evaluated by a self-hosted 8B-parameter judge model that classifies responses as compliant or non-compliant with safety guidelines. The authors report confidence intervals and paired bootstrap significance tests. The experiment is longitudinal: the same probing methodology is applied across successive model releases, enabling direct comparison rather than single-snapshot evaluation.

Gemma 3 regresses on safety

The headline result is a non-monotonic safety trajectory. Gemma 3 (12B) achieved a 68.7% (± 5.7%) overall ASR under MAP-Elites, compared to 45.5% (± 7.2%) for Gemma 2 (p = 0.030, paired bootstrap). Gemma 4 dropped back to 33.9% (± 1.8%). Safety did not accumulate across releases. It regressed, then partially recovered.

The category-level breakdown sharpens the picture. Misinformation ASR jumped from 29% on Gemma 2 to 99% on Gemma 3, and remained elevated at 77% on Gemma 4. Copyright and cybercrime vulnerability sat near 100% ASR across all four generations under the 8B judge, though the authors note that a second-judge audit produced divergent results on copyright, making that finding sensitive to judge choice.

Gemma 4 improved over Gemma 3 on most metrics. The takeaway is not that safety is uniformly degrading. It is that it is not reliably accumulating.

Old attacks on new models

The paper also tested whether adversarial prompts evolved against older Gemma generations transfer to newer ones. Attacks trained against Gemma 2 predecessors transferred to Gemma 3 at 44-46% ASR. The same attacks achieved only 14-18% ASR against Gemma 4, suggesting that Gemma 4’s safety improvements generalize beyond the specific attack distributions used during its alignment.

This is consistent with earlier work from llm-attacks.org, which demonstrated that adversarial suffix strings built against open-source LLMs transfer to closed-source models including ChatGPT, Bard, and Claude. That prior work noted that analogous adversarial attacks have resisted solutions in computer vision for a decade, and it remains unclear whether such attacks can ever be fully patched.

What static benchmarks miss

The paper states that the safety regressions it found are invisible to standard safety benchmarks. The mechanism is straightforward: static benchmarks evaluate against a fixed, known set of harmful prompts. MAP-Elites adaptively searches for new failure modes by evolving prompts based on what actually elicits harmful outputs from the model under test.

A model hardened against a benchmark’s prompt set can still be vulnerable to novel adversarial variants that the benchmark does not contain. This is the same well-known failure mode that makes static test suites unreliable in traditional software security: they verify that known failure modes are handled, not that unknown ones are.

The implication for vendors and auditors is that standard safety benchmarks provide a lower bound on safety, not a measurement of it. A model can pass every item in a benchmark suite and still be substantially more vulnerable than its predecessor to attacks the suite does not cover.

Shared representations, not shared alignment failures

A complementary study (arXiv:2506.12913), spanning 20 open-weight models and 33 jailbreak attacks, provides a mechanistic explanation for why transfer works at all. The authors found that jailbreak transferability is driven by representational similarity between models and by attack strength on the source model, not by artifacts of safety training. Persona-style attacks, which instruct the model to adopt a specific character, transfer far more reliably than cipher-based attacks that attempt to obscure harmful intent through encoding.

This is a useful distinction. If transfer were driven by shared safety-training failures, it might be fixable by diversifying the alignment procedure. If it is driven by shared representations in the base model, the problem is structural: any two models that learn similar internal representations of the same concepts will share adversarial vulnerabilities, regardless of how their safety training differs.

The regression-testing gap

The practical consequence is that every new model checkpoint requires a full re-run of the prior attack corpus, plus adaptive probing for novel failure modes. Treating safety as a one-way ratchet, where each release builds on the last, produces false confidence.

A separate evaluation of 10 open-source LLMs across 94 prompt-injection and 73 jailbreak scenarios (arXiv:2602.22242) found that lightweight inference-time defences handle straightforward attacks but are consistently bypassed by long, reasoning-heavy prompts. Layered defences help; they do not close the gap.

An ICML 2026 position paper raises a related structural concern: model-level governance becomes less effective when capability progress is driven by non-model gains such as inference scaling, systems-level orchestration, and data-asset improvements. Risk management that relies primarily on pre-deployment evaluation of a single model checkpoint misses the ways in which the surrounding system can amplify or suppress that model’s failure modes.

Taken together, these findings describe a testing regime that is more expensive than the industry has been operating under. Adaptive longitudinal probing is not optional if the goal is to catch regressions that static benchmarks miss. Cross-generation attack transfer means the attack corpus grows over time, not shrinks. And if transferability is rooted in shared representations rather than alignment artifacts, it cannot be engineered away by adjusting the safety-training recipe alone.

Frequently Asked Questions

Do these safety regression findings apply to closed-source models like GPT or Claude?

The paper tested only Google’s Gemma family (7B-31B parameters), but a separate 20-model transferability study (arXiv:2506.12913) found that representational similarity drives jailbreak transfer across vendors, not just within a single model line. The mechanism is vendor-agnostic, yet no published longitudinal study has tracked non-monotonic safety across GPT or Claude generations, so the pattern remains empirically unconfirmed outside Gemma.

What would a proper cross-generation regression test cost in practice?

The ICML 2026 position paper (arXiv:2606.00047) identifies three categories of non-model capability gains that undermine checkpoint-only evaluation: inference scaling, systems-level orchestration, and data-asset improvements. A regression suite that accounts for all three must re-probe not just the base model but the full deployment stack each time any component changes, multiplying the test surface beyond what single-checkpoint safety reports cover and pushing cost well above a benchmark rerun.

Why do persona-style jailbreaks transfer across models but cipher-based ones do not?

Persona attacks instruct the model to adopt a character, tapping into conversational representations that are structurally similar across most instruction-tuned models. Cipher-based attacks encode harmful intent through character-level substitutions that are representation-specific: a cipher that confuses one tokenizer will likely fail on another. The 20-model study found this gap is substantial, not marginal.

Could the 29% to 99% misinformation spike be a measurement artifact from the judge?

The authors flag the copyright and cybercrime findings as judge-sensitive after their second-judge audit, but they do not re-audit misinformation with a different judge. Any single-judge ASR measurement carries the same structural risk: a judge misaligned on a category will over- or under-count compliant responses. The misinformation spike may be real, but it has not been cross-validated against a second judge, unlike the copyright numbers.

sources · 5 cited

  1. Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs primary accessed 2026-06-02
  2. Universal and Transferable Attacks on Aligned Language Models primary accessed 2026-06-02
  3. Jailbreak Transferability Emerges from Shared Representations primary accessed 2026-06-02
  4. Analysis of LLMs Against Prompt Injection and Jailbreak Attacks analysis accessed 2026-06-02
  5. Comprehensive AI governance requires addressing non-model gains analysis accessed 2026-06-02