groundy
models & research

Can You Make a Multimodal Model Unlearn With Activation Steering?

Steering vectors suppress behavior at runtime without editing weights. Two 2026 papers show they transfer between models, so suppression alone is not unlearning.

7 min · · · 2 sources ↓

Activation steering can suppress a model’s tendency to produce specific outputs at inference time, but suppressing the behavior is not the same as removing the capability from the weights. Two June 2026 papers on arXiv provide a mechanism-level account of why: steering vectors are portable, transferable, and recoverable, which means a model that appears to have “unlearned” a behavior may simply be running with a vector that masks it.

What does activation steering actually change?

Activation steering works by adding a fixed vector to a model’s intermediate activations during forward passes. The vector is typically extracted by computing the difference in mean activations between a “positive” behavior and a “negative” behavior across a set of prompts. When applied at inference time, this vector biases the model’s internal representations toward or away from the targeted behavior.

The model’s weights do not change. The steering vector is an external perturbation applied at runtime; remove it, and the original behavior resurfaces.

According to arXiv:2606.00995, a teacher model’s system prompt can be approximated by a single steering vector added to activations. This means the vector encodes enough signal to shift the model’s output distribution to match a policy it was not explicitly trained on. But encoding the policy in a vector is not the same as editing it into the weights.

How do steering vectors transfer between models?

The transfer properties of steering vectors are what make them relevant to unlearning claims. arXiv:2606.00995 documents a mechanism the authors call “steering-vector distillation”: when a student model is fine-tuned on outputs from a steered teacher, the student learns an aligned steering vector internally, even without the original vector being applied to the student.

This produces what the paper terms “subliminal learning.” Students acquire behavioral traits from the teacher’s outputs even when those traits are semantically unrelated to the training data. The vector is the carrier, not the content.

The paper identifies a key constraint: system prompts that are not well approximated by steering vectors are not subliminally learned. This establishes a boundary condition on what steering can and cannot transmit between models.

A companion study, “Quantifying Subliminal Behavioral Transfer Ratios”, measures the scaling properties of this transfer. The results are model-dependent. According to the paper, Llama-2 exhibits a sharp threshold: transfer ratios jump to τ values of 0.25 and 0.32 once the steering coefficient α exceeds −0.15. Qwen2.5, by contrast, shows continuous and higher transfer, reaching τ = 0.61. These are not universal properties of language models. They are artifacts of specific architectures and training distributions.

The distillation channel is also bidirectional. The same mechanism that implants a steering vector in a student from teacher outputs can, in principle, be used to extract or override one. arXiv:2606.00995 further notes that adaptive optimizers (Adam-family) are necessary for subliminal learning to occur. Activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers let outlier gradients dominate and suppress this signal. For unlearning claims, the optimizer choice during any subsequent fine-tuning affects whether suppressed behaviors can resurface.

How should you evaluate whether a model has truly unlearned something?

The evaluator burden is the real story here. If steering-based suppression can mimic unlearning without removing the underlying capability, then the standard evaluation protocol of checking whether a model refuses to produce forbidden outputs on a held-out prompt set is insufficient.

A more rigorous evaluation would need to test:

  • Adversarial recovery: can the suppressed behavior be elicited through prompt engineering, jailbreaks, or input perturbations that were not part of the original steering setup?
  • Probing-based detection: do linear probes trained on the model’s internal activations still classify the “unlearned” knowledge as present in the representations?
  • Fine-tuning recovery: can a small amount of fine-tuning on related data reactivate the suppressed behavior, and if so, how much data is required?

None of these appear in standard unlearning benchmarks, and none are addressed by the two papers reviewed here. But the transfer results from arXiv:2606.11270 suggest that the answer to the fine-tuning recovery question may depend heavily on model architecture: a model like Qwen2.5 with higher behavioral transfer ratios may also be more susceptible to recovery attacks.

What does this mean for multimodal models?

Multimodal models introduce an additional wrinkle. Steering vectors extracted from text-only activations may not transfer cleanly to vision or audio processing pathways, but the subliminal learning mechanism documented in arXiv:2606.00995 suggests that semantic steering effects can emerge even from non-semantic generated data. The paper reports that non-semantic generated data can still transmit a vector with semantic effects, which means a multimodal model’s vision encoder could, in principle, carry steering-corrupted signals into shared representation layers.

Whether ASRU’s pairing of activation steering with reinforcement learning-based unlearning closes the suppression-versus-erasure gap in the multimodal setting is a claim that cannot be verified from the available sources. The RL component would presumably edit weights rather than just steer activations at inference time, which could, if effective, produce genuine capability removal. But the degree of removal, the architecture-dependence of the results, and the robustness to adversarial probing all remain open questions that the retrieved sources do not address.

The two papers that are available establish one thing clearly: activation steering alone is a runtime intervention, and the behavioral transfer literature shows that runtime interventions have structural properties (portability, threshold effects, optimizer dependence) that make them ill-suited as the sole mechanism for claims about knowledge removal. Any unlearning method that relies partly on steering vectors needs to demonstrate that the weight-edit component of the method does the actual removal work, because the steering component, by construction, does not.

Frequently Asked Questions

Does the optimizer used during downstream fine-tuning affect whether a suppressed behavior can resurface?

Yes. Adaptive optimizers like Adam maintain a consistent gradient component along the original steering direction, which is the same mechanism that enables subliminal learning. Fine-tuning a previously steered model with an adaptive optimizer on unrelated data can reactivate the suppressed behavior through that gradient channel. Using a non-adaptive optimizer for downstream training is a potential defense, but non-adaptive methods typically require more manual learning-rate tuning and may converge more slowly on the primary task.

Why do Llama-2 and Qwen2.5 show different transfer profiles, and what does that mean for red-teaming?

Llama-2 has a sharp threshold around steering coefficient α = −0.15: below it, behavioral transfer is near zero; above it, transfer jumps to τ values of 0.25–0.32. Qwen2.5 shows a continuous, roughly linear relationship, reaching τ = 0.61 with no sharp cutoff. For practitioners, Llama-2 has an identifiable safe operating range below the threshold where subliminal transfer is negligible. Qwen2.5 does not, so every increase in steering strength leaks proportionally more behavior and red-team probes must cover the full coefficient range.

Can a model that was never directly steered still exhibit steered behavior?

Yes, through steering-vector distillation. A student model fine-tuned on outputs from a steered teacher internalizes an aligned vector, acquiring the steered behavior even when the training data is topically unrelated to that behavior. The supply-chain implication is direct: a team that applies activation steering to a production model and then shares or sells the resulting outputs as training data will propagate the suppressed behavior to downstream models that were never steered. The boundary condition is that system prompts poorly approximated by steering vectors do not transfer this way.

Could a steering vector extracted from text-only activations influence vision or audio pathways in a multimodal model?

Steering vectors carry both semantic (model-independent) and non-semantic (model-specific) effects. The semantic component can propagate through shared representation layers that multimodal architectures use to fuse text, image, and audio signals. The finding that non-semantic generated data still transmits vectors with semantic effects suggests the cross-modal pathway is plausible even without a direct vector match: a text-derived vector could shift shared-layer activations that vision and audio encoders also read from. Direct cross-modal transfer benchmarks are not provided in the available sources.

sources · 2 cited

  1. Subliminal Learning Is Steering Vector Distillation primary accessed 2026-06-12
  2. Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation primary accessed 2026-06-12