Why Fine-Tuning Strips Safety Alignment From Open-Weight LLMs

Safety alignment in open-weight LLMs does not survive fine-tuning, even when the fine-tuning data is clean, harmless, and carefully curated. PACT (Preserving Safety Alignment via Constrained Tokens), revised through a v3 posting on June 5 [Updated June 2026], shows why: safety behavior is concentrated in a small set of output tokens, and ordinary fine-tuning destabilizes those tokens regardless of what the fine-tuning data contains.

Safety lives in a handful of tokens

The PACT paper’s core finding is mechanistic. Safety-aligned LLMs do not distribute their refusal behavior evenly across the vocabulary. Instead, a small subset of “safety tokens” carries disproportionate confidence when the model is refusing a harmful request. These are the tokens whose probability mass shifts most between a safety-trained model and its unaligned base version.

When someone fine-tunes the model on downstream task data, the gradient updates do not discriminate between safety-critical parameters and task-relevant ones. The safety tokens’ confidence distributions get overwritten as collateral damage. The model has not been “taught to be unsafe” in any deliberate sense; the safety signal was simply in the way, and fine-tuning ran over it.

This is a narrow-concentration problem, not a whole-model problem. PACT’s contribution is showing that the concentration is tight enough to target directly.

Benign data is enough

Prior work established that fine-tuning can strip alignment. Qi et al. (2023), an ICLR 2024 oral, demonstrated that fine-tuning GPT-3.5 Turbo on benign instruction data was sufficient to degrade its refusal behavior, and that an adversarial set of just 10 examples jailbroke it for under $0.20 [Updated June 2026]. Lermen et al. (2024) showed that Low-Rank Adaptation (LoRA) can undo safety training across Llama 2-Chat at 7B, 13B, and 70B for under $200 in total compute, driving the 70B model’s refusal rate to roughly 1 percent [Updated June 2026].

The newer results push further. The OpenReview study “Picky LLMs and Unreliable RMs” identifies three specific mechanisms that drive safety degradation during benign fine-tuning: changes to answer structure, identity calibration drift, and role-play effects. None of these require adversarial data. The fine-tuning corpus can be entirely benign; the degradation comes from how fine-tuning reshapes the model’s output distribution, not from what the training examples contain.

The problem extends beyond text. A separate study on audio LLMs found that benign fine-tuning elevated Jailbreak Success Rate (JSR) from single-digit percentages to as high as 87.12% across Audio Flamingo 3, Kimi-Audio-7B-Instruct, and Qwen2.5-Omni [Updated June 2026]. That figure comes from proximity-filtered benign data selected for embedding-space similarity to harmful content, a stronger condition than random benign data. But random benign data also elevated JSR in the same experiments, confirming that the effect does not depend on adversarial data selection. The audio result is also a reminder that safety training built for one modality does not transfer cleanly, a gap Groundy examined in why audio jailbreaks slip past text-trained safety.

PACT’s constrained-token defense

PACT proposes a targeted intervention: during fine-tuning, constrain the model’s output confidence on the identified safety tokens, preventing those tokens’ probability distributions from drifting. Because the safety signal is concentrated in relatively few tokens, the constraint is narrow enough to avoid degrading downstream task performance, according to the paper’s benchmarks.

This is a different approach from prior defenses, which the paper categorizes as model-wide interventions: restricting parameter updates globally or injecting safety data into the fine-tuning mix. Those methods work at the cost of generality or task quality. PACT’s argument is that targeting the specific tokens where safety lives preserves both safety and task performance simultaneously.

The approach has open questions. The safety-token concentration finding needs replication across more model families to determine whether it generalizes beyond the architectures the paper tested. And the method’s reliance on a reference model’s confidence distribution limits its applicability in settings where fine-tuning happens through a black-box API.

The localization claim is contested

PACT’s premise is that safety is concentrated tightly enough to fence off. Not everyone agrees the geometry is that clean. Safety Subspaces are Not Linearly Distinct, a May 2025 fine-tuning case study, found that the directions in activation and weight space that amplify safe behavior also amplify generally useful behavior, and that prompts with different safety profiles activate overlapping representations. If safe and useful computation share the same subspace, a defense that tries to isolate and freeze a “safety” component risks freezing capability with it, or missing the part of the safety signal that lives in the shared directions [Updated June 2026].

That tension does not refute PACT’s token-level result, which operates on output confidence rather than internal subspaces, but it bounds how far the localization story travels. Token-level confidence is a downstream readout; the upstream representations that produce it may be diffuse. A defense calibrated to the tokens can still be undercut by drift in the representations feeding them. The honest reading is that safety is localized enough to target at the output layer and diffuse enough internally that targeting alone is no guarantee.

Defenses are moving upstream

If fine-tuning reliably erodes alignment, the response splits into two camps: make the released weights resistant to tampering, or repair alignment after it drifts.

The tamper-resistance camp tries to bake durability into the checkpoint before release. Tamper-Resistant Safeguards (TAR), at ICLR 2025, trained safeguards meant to survive hundreds of fine-tuning steps rather than collapse on the first few, the first credible claim that an open-weight safeguard could outlast a determined fine-tuner. Deep Ignorance, from August 2025, pushed the intervention back to pretraining: strip biothreat-proxy text from the corpus so the dangerous capability is never learned, leaving nothing for malicious fine-tuning to surface. Removing knowledge the model never had is a sturdier guarantee than asking it to refuse knowledge it does have [Updated June 2026].

The repair camp accepts that downstream weights drift and tries to anchor them during the update. AsFT, from June 2025, builds on the “safety basin” geometry, penalizing gradient steps that move orthogonal to the alignment direction so fine-tuning stays inside the region where refusal behavior holds. A complementary line asks whether one reusable adapter can re-align an already-degraded model instead of retraining from scratch, an approach Groundy covered in can one safety adapter realign every fine-tuned LLM.

None of these is settled, and the benchmarking has only recently caught up. Through early 2026 each paper reported tamper-resistance against whatever attack it happened to choose, which makes the numbers hard to compare. A safeguard that survives 100 fine-tuning steps of one attack may fold on step 10 of another. Until the attack suite is standardized, “resistant” describes a result against a specific adversary, not a property of the weights.

Reward models miss the problem

The OpenReview study raises a separate concern that compounds the fine-tuning problem: the reward models (RMs) used to guide alignment processes frequently fail to reflect human preferences on safety judgments. If the reward model cannot reliably distinguish aligned from unaligned outputs, fine-tuning pipelines that include a safety reinforcement step may not catch the degradation.

This matters because reward-model-based alignment, RLHF and its descendants, is the standard approach for post-training safety. If RMs are unreliable arbiters of safety, the evaluation infrastructure that providers rely on to certify alignment has a structural blind spot. That blind spot is separate from and precedes the fine-tuning erosion problem: even before anyone fine-tunes, the tooling used to verify alignment may not be measuring what its operators think it measures.

The policy gap is structural

Open-weight model providers ship safety evaluations measured at release. Those evaluations describe the model’s behavior on specific benchmarks, at a specific checkpoint. They do not describe the model’s behavior after downstream fine-tuning.

This is a measurement gap, not a vendor deception. The PACT and OpenReview findings together establish that well-intentioned fine-tuning on clean data can erode safety alignment through mechanisms invisible without token-level inspection. A provider’s at-release safety report is a snapshot of one checkpoint, not a guarantee about every derivative model built from those weights.

The policy consequence follows directly: any regulatory or certification regime that treats an at-release safety evaluation as durable rests on an assumption the research no longer supports. The same durability problem shows up when safety is enforced at inference rather than baked into training, a certification gap Groundy has covered before. The burden of verifying alignment falls on every operator who fine-tunes and deploys, but those operators generally lack both the tools and the incentive structure to run their own safety evaluations on derivative models.

What practitioners can do today

Fine-tuning open-weight models for deployment is routine. Preserving safety through that process is not. Several concrete steps are available now, though none are free:

Re-evaluate after fine-tuning. Run the same safety benchmarks on the fine-tuned checkpoint that the base model was evaluated on at release. This is the minimum viable check. It catches gross degradation but may miss subtler distributional shifts that targeted probes would surface.

Monitor for the three drift mechanisms. The OpenReview study identifies answer-structure changes, identity calibration drift, and role-play effects as the primary degradation channels. Evaluation suites should probe for these specifically, rather than relying on general-purpose safety benchmarks that were not designed to catch them.

Consider constrained fine-tuning where feasible. PACT-style token constraints require white-box access to the aligned reference model, but when available, the approach preserves refusal behavior without measurable task-performance penalties, according to the paper’s reported results. If replicated across more architectures, this could become a standard fine-tuning step.

Treat reward-model scores with skepticism. The finding that state-of-the-art RMs frequently fail to reflect human safety preferences means RM-guided fine-tuning should not be treated as a sufficient safety guardrail on its own. Pair RM evaluations with direct human review of outputs on adversarial prompts.

These are interim measures for a gap that will persist until safety-preserving fine-tuning methods mature and become standard in fine-tuning toolchains. The research direction is clear: safety alignment is a property that must be actively maintained through downstream modifications, not assumed to persist by default from a base model’s release evaluation.

Frequently Asked Questions

Does this safety erosion affect API-tuned models like GPT-3.5, or only models with downloadable weights?

Both. Qi et al. demonstrated the effect on GPT-3.5 Turbo through OpenAI’s fine-tuning API using only benign Alpaca data. The open-weight distinction matters for defenses, not for the erosion itself: operators with local weights can apply PACT-style token constraints, while API users have no equivalent tool and must rely entirely on post-hoc evaluation.

What does it cost to strip safety alignment versus to rebuild it?

Lermen et al. showed that LoRA removes safety training for under $200 in compute. Restoring alignment requires a full RLHF pipeline: recruiting human annotators, training a fresh reward model, and running policy optimization over multiple iterations. Those pipelines typically run into the tens of thousands of dollars and span weeks. The cost asymmetry favors stripping over restoring by roughly two orders of magnitude.

Is data filtering an effective defense against fine-tuning safety erosion?

No. Data filtering targets adversarial fine-tuning, where harmful examples are deliberately injected into the training set. The PACT and OpenReview studies address a different threat model: benign fine-tuning on clean data that degrades safety through structural side effects such as answer-structure shifts, identity drift, and role-play amplification. Filtering harmful examples from an already-clean dataset has nothing to act on. The defense must target the fine-tuning process itself, not the training data.

What evaluation options exist for teams using commercial API fine-tuning endpoints?

API operators face compounding blind spots. They cannot inspect token-level confidence, ruling out PACT. They cannot access model logits, which prevents custom adversarial probing beyond what the endpoint returns as text. The OpenReview findings show that vendor-provided reward-model scores may not reflect actual safety. The remaining option is black-box prompt testing against a curated adversarial set, which catches gross refusal failures but misses the subtler distributional drifts (identity confusion, role-play amplification) that the OpenReview study identifies as primary degradation channels.