Activation Steering Was Sold as LLM Control. New Work Makes It an Attack Surface

Activation steering was supposed to be the lightweight alignment knob that let deployers nudge model behavior without retraining. A paper published June 4 on arXiv by Donato Crisostomi and collaborators shows the knob turns both ways: an attacker who controls the steering dataset can silently invert those vectors into reliable jailbreaks, with attack success rates of 20, 55% across two open-weight model families.

What activation steering was supposed to do

Activation steering (also called representation engineering) works by computing a “refusal direction vector” from pairs of contrasting examples and injecting it into the model’s hidden states during inference. The appeal is straightforward: no fine-tuning, no retraining, no changes to the model weights. A deployer shifts behavior toward or away from specific outputs by adjusting a vector in activation space.

The known cost, as work on principled steering via null-space projection (NullSteer, CVPR 2026) documents, is that steering vectors do not distinguish between benign and malicious inputs. Apply a refusal direction and you get over-refusal: the model rejects harmless prompts alongside genuinely dangerous ones. That tradeoff has been acknowledged since the technique’s adoption. The previously unknown cost was that the same mechanism is invertible by an adversary.

4, 6% token substitution, total inversion

Crisostomi et al. (arXiv:2606.05958) demonstrate a stealth data poisoning attack on the steering-vector construction pipeline. The attacker modifies only 4, 6% of the tokens in the dataset used to derive the steering vector. The modifications are small enough that the resulting vector still performs its intended steering function on benign prompts. Simultaneously, the vector is silently aligned with an anti-refusal direction that jailbreaks the target model on adversarial inputs.

The poisoning is invisible to standard inspection. The dataset looks correct. The derived vector produces expected behavior on benign queries. The attack surface is buried in the linear algebra of vector construction rather than in the model weights or the prompt, where most safety tooling currently focuses.

Attack results across two model families

The authors evaluated the attack across two open-weight model families and eight model-attribute combinations. The absolute attack success rate (ASR) ranged from 20% to 55%, a +19% to +51% increase over the clean-reference baseline, according to arXiv:2606.05958. The range is wide because it reflects different model architectures, different steering targets, and different refusal behaviors.

These numbers describe a poisoning attack on the steering dataset, not on the model weights. The adversary needs control over the data used to construct the steering vector. They do not need access to the model’s parameters, its training pipeline, or its system prompt.

The supply-chain attack vector

The threat model is a supply-chain compromise, not a direct model break-in. Steering vectors are increasingly shared as community artifacts: bundles of text datasets, precomputed vectors, and sometimes fine-tuned adapter weights. A malicious actor can distribute such a bundle alongside an equivalence certificate that end-users can verify against a reference implementation, per arXiv:2606.05958. The certificate checks out. The vector behaves as expected on benign inputs. The poisoning goes undetected.

This is structurally similar to SeedHijack (arXiv:2605.28632), which demonstrates that replacing the PRNG in an LLM watermarking scheme produces outputs that pass all six evaluated statistical detectors as watermarked when they are not. Both attacks target inference-time provenance and safety infrastructure, not the model itself. Both exploit the assumption that auxiliary artifacts, whether steering datasets or watermark PRNGs, can be trusted without independent verification.

Orthogonalization as a partial defense

The paper proposes a refusal-direction orthogonalization defense: project the steering vector onto the subspace orthogonal to the refusal direction before applying it. This recovers approximately 82% of the ASR gap introduced by poisoning without degrading performance on benign tasks, according to arXiv:2606.05958.

That is not a complete fix. Recovering 82% of the gap leaves the remaining 18% on the table. The defense also assumes the deployer knows the refusal direction to orthogonalize against, which requires access to refusal/non-refusal contrast pairs that may not always be available for proprietary or closed-weight models.

Separately, NullSteer (CVPR 2026) addresses a different but related problem: making activation steering selective so that it steers malicious inputs while leaving benign activations mathematically unchanged. NullSteer’s null-space projection and Crisostomi et al.’s orthogonalization defense are complementary rather than redundant. One improves steering selectivity; the other hardens the steering vector against dataset compromise. Neither alone closes the full gap.

Inference-time controls are dual-use

The pattern across these results is consistent as of mid-2026. Activation steering, watermarking, and other inference-time interventions were designed as one-directional levers: the deployer pulls, the model complies. That assumption does not survive contact with an adversary who controls any part of the intervention’s supply chain.

Anyone shipping activation-level safety controls now faces a different threat model than the one that motivated adoption. The steering vector is not a free alignment knob. It is a privileged interface into the model’s decision-making process, and anyone who can influence the data feeding that interface can exploit it. The orthogonalization defense provides a hardening strategy, and NullSteer’s selectivity work addresses the over-refusal problem that motivated shared steering vectors in the first place. The structural lesson is that inference-time safety controls need the same supply-chain integrity guarantees as the model weights themselves.

Frequently Asked Questions

Does this attack work against closed-weight API models like Claude or GPT?

The poisoning requires control over the steering dataset used to derive the refusal direction vector. Closed-weight API providers generally do not expose activation-level hooks to external callers, so the direct attack surface is limited to self-hosted deployments where teams construct or download steering vectors from third parties. The indirect risk is that if a provider itself uses community-sourced vectors internally, the same jailbreak behavior would manifest in API outputs. The attacker would need to compromise the provider’s internal supply chain rather than a public bundle, a higher-bar but structurally identical threat.

What should a team running self-hosted models change today?

Stop consuming community-shared steering vectors without independent derivation. Build vectors from internally sourced contrast pairs, apply refusal-direction orthogonalization before deployment, and treat steering datasets with the same integrity controls as model weights: signed provenance, checksum verification, and no untrusted bundles in production. Adding NullSteer-style null-space projection as a second layer further narrows the attack window by restricting the vector’s effect to genuinely malicious inputs, reducing both jailbreak exposure and the over-refusal penalty on benign queries.

How does the 4-6% poisoning threshold compare to training-time attacks?

Training-time data poisoning typically requires corrupting 10-20% or more of a dataset to achieve comparable attack success, depending on model size and poison strategy. The 4-6% threshold here is low because the attack targets a derived artifact (a single direction vector) rather than distributed model weights. Poisoning a vector construction pipeline is more efficient because the attack surface is concentrated: small perturbations in the dataset get amplified by the differencing operation (contrasting paired examples) into a consistent directional bias across the entire activation space.

Could equivalence certificates be hardened to catch this specific poisoning?

Current equivalence certificates verify that a vector matches a reference on test inputs, but the poisoning preserves benign performance by design. A stronger certificate would need to probe specifically for anti-refusal alignment by running the vector against a curated set of adversarial and borderline prompts and comparing outputs against a known-clean baseline. That turns a lightweight hash check into a full evaluation suite, raising verification cost. The paper demonstrates the weakness of structural-integrity checks against semantic compromise but does not propose an enhanced certificate standard.