groundy
ethics, policy & safety

Why Fine-Tuning Strips Safety Alignment From Open-Weight LLMs

Safety alignment in open-weight LLMs is concentrated in a handful of output tokens. Benign fine-tuning erases them, making release-time safety evaluations unreliable.

7 min · · · 3 sources ↓

Safety alignment in open-weight LLMs does not survive fine-tuning, even when the fine-tuning data is clean, harmless, and carefully curated. PACT (Preserving Safety Alignment via Constrained Tokens), updated June 3 with its v2 revision, shows why: safety behavior is concentrated in a small set of output tokens, and ordinary fine-tuning destabilizes those tokens regardless of what the fine-tuning data contains.

Safety lives in a handful of tokens

The PACT paper’s core finding is mechanistic. Safety-aligned LLMs do not distribute their refusal behavior evenly across the vocabulary. Instead, a small subset of “safety tokens” carries disproportionate confidence when the model is refusing a harmful request. These are the tokens whose probability mass shifts most between a safety-trained model and its unaligned base version.

When someone fine-tunes the model on downstream task data, the gradient updates do not discriminate between safety-critical parameters and task-relevant ones. The safety tokens’ confidence distributions get overwritten as collateral damage. The model has not been “taught to be unsafe” in any deliberate sense; the safety signal was simply in the way, and fine-tuning ran over it.

This is a narrow-concentration problem, not a whole-model problem. PACT’s contribution is showing that the concentration is tight enough to target directly.

Benign data is enough

Prior work established that fine-tuning can strip alignment. Qi et al. (2023) demonstrated that fine-tuning GPT-3.5 Turbo on benign Alpaca data was sufficient to jailbreak the model. Lermen et al. (2024) showed that Low-Rank Adaptation (LoRA) can undo safety training for under $200 in compute costs.

The newer results push further. The OpenReview study “Picky LLMs and Unreliable RMs” identifies three specific mechanisms that drive safety degradation during benign fine-tuning: changes to answer structure, identity calibration drift, and role-play effects. None of these require adversarial data. The fine-tuning corpus can be entirely benign; the degradation comes from how fine-tuning reshapes the model’s output distribution, not from what the training examples contain.

The problem extends beyond text. A separate study on audio LLMs found that benign fine-tuning elevated Jailbreak Success Rate (JSR) from single-digit percentages to as high as 87.12%. That figure comes from proximity-filtered benign data selected for embedding-space similarity to harmful content, a stronger condition than random benign data. But random benign data also elevated JSR in the same experiments, confirming that the effect does not depend on adversarial data selection.

PACT’s constrained-token defense

PACT proposes a targeted intervention: during fine-tuning, constrain the model’s output confidence on the identified safety tokens, preventing those tokens’ probability distributions from drifting. Because the safety signal is concentrated in relatively few tokens, the constraint is narrow enough to avoid degrading downstream task performance, according to the paper’s benchmarks.

This is a different approach from prior defenses, which the paper categorizes as model-wide interventions: restricting parameter updates globally or injecting safety data into the fine-tuning mix. Those methods work at the cost of generality or task quality. PACT’s argument is that targeting the specific tokens where safety lives preserves both safety and task performance simultaneously.

The approach has open questions. The safety-token concentration finding needs replication across more model families to determine whether it generalizes beyond the architectures the paper tested. And the method’s reliance on a reference model’s confidence distribution limits its applicability in settings where fine-tuning happens through a black-box API.

Reward models miss the problem

The OpenReview study raises a separate concern that compounds the fine-tuning problem: the reward models (RMs) used to guide alignment processes frequently fail to reflect human preferences on safety judgments. If the reward model cannot reliably distinguish aligned from unaligned outputs, fine-tuning pipelines that include a safety reinforcement step may not catch the degradation.

This matters because reward-model-based alignment, RLHF and its descendants, is the standard approach for post-training safety. If RMs are unreliable arbiters of safety, the evaluation infrastructure that providers rely on to certify alignment has a structural blind spot. That blind spot is separate from and precedes the fine-tuning erosion problem: even before anyone fine-tunes, the tooling used to verify alignment may not be measuring what its operators think it measures.

The policy gap is structural

Open-weight model providers ship safety evaluations measured at release. Those evaluations describe the model’s behavior on specific benchmarks, at a specific checkpoint. They do not describe the model’s behavior after downstream fine-tuning.

This is a measurement gap, not a vendor deception. The PACT and OpenReview findings together establish that well-intentioned fine-tuning on clean data can erode safety alignment through mechanisms invisible without token-level inspection. A provider’s at-release safety report is a snapshot of one checkpoint, not a guarantee about every derivative model built from those weights.

The policy consequence follows directly: any regulatory or certification regime that treats an at-release safety evaluation as durable rests on an assumption the research no longer supports. The burden of verifying alignment falls on every operator who fine-tunes and deploys, but those operators generally lack both the tools and the incentive structure to run their own safety evaluations on derivative models.

What practitioners can do today

Fine-tuning open-weight models for deployment is routine. Preserving safety through that process is not. Several concrete steps are available now, though none are free:

Re-evaluate after fine-tuning. Run the same safety benchmarks on the fine-tuned checkpoint that the base model was evaluated on at release. This is the minimum viable check. It catches gross degradation but may miss subtler distributional shifts that targeted probes would surface.

Monitor for the three drift mechanisms. The OpenReview study identifies answer-structure changes, identity calibration drift, and role-play effects as the primary degradation channels. Evaluation suites should probe for these specifically, rather than relying on general-purpose safety benchmarks that were not designed to catch them.

Consider constrained fine-tuning where feasible. PACT-style token constraints require white-box access to the aligned reference model, but when available, the approach preserves refusal behavior without measurable task-performance penalties, according to the paper’s reported results. If replicated across more architectures, this could become a standard fine-tuning step.

Treat reward-model scores with skepticism. The finding that state-of-the-art RMs frequently fail to reflect human safety preferences means RM-guided fine-tuning should not be treated as a sufficient safety guardrail on its own. Pair RM evaluations with direct human review of outputs on adversarial prompts.

These are interim measures for a gap that will persist until safety-preserving fine-tuning methods mature and become standard in fine-tuning toolchains. The research direction is clear: safety alignment is a property that must be actively maintained through downstream modifications, not assumed to persist by default from a base model’s release evaluation.

Frequently Asked Questions

Does this safety erosion affect API-tuned models like GPT-3.5, or only models with downloadable weights?

Both. Qi et al. demonstrated the effect on GPT-3.5 Turbo through OpenAI’s fine-tuning API using only benign Alpaca data. The open-weight distinction matters for defenses, not for the erosion itself: operators with local weights can apply PACT-style token constraints, while API users have no equivalent tool and must rely entirely on post-hoc evaluation.

What does it cost to strip safety alignment versus to rebuild it?

Lermen et al. showed that LoRA removes safety training for under $200 in compute. Restoring alignment requires a full RLHF pipeline: recruiting human annotators, training a fresh reward model, and running policy optimization over multiple iterations. Those pipelines typically run into the tens of thousands of dollars and span weeks. The cost asymmetry favors stripping over restoring by roughly two orders of magnitude.

Is data filtering an effective defense against fine-tuning safety erosion?

No. Data filtering targets adversarial fine-tuning, where harmful examples are deliberately injected into the training set. The PACT and OpenReview studies address a different threat model: benign fine-tuning on clean data that degrades safety through structural side effects such as answer-structure shifts, identity drift, and role-play amplification. Filtering harmful examples from an already-clean dataset has nothing to act on. The defense must target the fine-tuning process itself, not the training data.

What evaluation options exist for teams using commercial API fine-tuning endpoints?

API operators face compounding blind spots. They cannot inspect token-level confidence, ruling out PACT. They cannot access model logits, which prevents custom adversarial probing beyond what the endpoint returns as text. The OpenReview findings show that vendor-provided reward-model scores may not reflect actual safety. The remaining option is black-box prompt testing against a curated adversarial set, which catches gross refusal failures but misses the subtler distributional drifts (identity confusion, role-play amplification) that the OpenReview study identifies as primary degradation channels.

sources · 3 cited

  1. Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning primary accessed 2026-06-04
  2. Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs primary accessed 2026-06-04
  3. Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment Degradation in Benign Fine-tuning primary accessed 2026-06-04