Refusal Steering Targets Individual Experts in MoE LLMs

MoE routing as a safety attack surface

In a mixture-of-experts model, each token is routed to a subset of expert modules by a learned gating function. The routing decision is fast and sparse; that is the whole efficiency argument. But sparse routing also means specific behaviors, including refusal of harmful prompts, may activate a consistent subset of experts across layers. If those experts can be identified, they can be suppressed.

Two independent research efforts published within weeks of each other arrive at the same conclusion: refusal is not a distributed property of the full weight matrix. SteerMoE identifies behavior-associated experts by comparing activation frequency between paired safe and unsafe inputs. MASCing models cross-layer routing dependencies with an LSTM-based surrogate, then optimizes a steering matrix to locate behavior-relevant experts. Neither modifies model weights. Both intervene at inference time by adjusting routing-gate activations.

SteerMoE: the ICLR-accepted proof

SteerMoE, authored by Fayyaz, Modarressi, Deilamsalehy, Dernoncourt, Rossi, Bui, Schütze, and Peng, appears in the Proceedings of ICLR 2026. It tests expert detection and steering across 11 benchmarks and 6 LLMs. Activating the identified “safety experts” raises safety scores by up to +20 percentage points and faithfulness by up to +27 over the unsteered baseline, according to the paper.

The safety improvement is the constructive direction. The paper also measures the reverse. Suppressing those same experts drops safety scores by 41 percentage points. Combined with existing jailbreak methods, the safety score reaches -100%, bypassing every tested guardrail. The authors frame this as evidence that expert-level steering is a dual-use tool: the same handle that hardens a model can strip its protections.

MASCing: cross-layer steering masks

MASCing (MoE Activation Steering Configuration) takes a more structured approach to routing intervention. Rather than comparing per-expert activation frequencies in isolation, it trains an LSTM-based surrogate to capture dependencies between routing decisions across layers. A steering matrix is then optimized against this surrogate, and the resulting masks are applied to routing gates during inference.

On the defense side, MASCing raises average multi-turn jailbreak defense success from 52.5% to 83.9%, with per-model gains as high as 89.2%, across seven open-source MoE models, according to the paper. The expensive step, surrogate training and mask optimization, happens once offline. Applying the mask at inference adds minimal per-token cost.

MASCing also demonstrates the offensive direction. Configured to enable rather than refuse adult-content requests, it raises average generation success from 52.6% to 82.0%, with per-model gains up to 93.0%. The authors present this as a reconfigurability measurement on open-source models, not a deployed attack.

The dual-use gap in open-weight safety

The uncomfortable finding shared by both papers is not that safety can be steered; that is expected given how MoE routing works. The uncomfortable finding is how little intervention is required. A handful of experts, identified by activation statistics, account for tens of percentage points of safety behavior. No fine-tuning, no weight modification, no retraining. Adjust the routing gates and the model complies.

For open-weight models, this creates a specific asymmetry. Anyone who can run inference can identify safety experts using the paired-input methodology SteerMoE describes. The expertise barrier is low, the compute cost is trivial, and the effect is large (the -41 pp standalone drop measured by SteerMoE, reaching -100% combined with existing jailbreaks). The same low-cost handle is available to deployers who want to harden safety, but the defensive side gains less: a model whose safety experts are publicly known is easier to attack than a model whose safety behavior is diffuse across the full weight matrix.

Closed-weight models with unknown architectures are a different matter. Both papers test exclusively on open-source MoE models. Whether the same expert concentration holds in models with undisclosed routing structures, and whether those structures are recoverable from black-box access, remains an open question.

What per-expert auditing would require

If safety behavior concentrates in identifiable experts, whole-model safety evaluations are necessary but insufficient. A model that passes a safety benchmark after deployment may still have identifiable experts whose suppression bypasses that same benchmark. The audit surface expands from “does the model refuse harmful prompts?” to “can the model’s refusal behavior be surgically suppressed at the routing level?”

In practice, per-expert auditing would mean running paired safe/unsafe prompt sets through the model, recording expert activation frequencies, identifying safety-associated experts, and testing whether suppressing them degrades safety scores. SteerMoE and MASCing both provide the methodology for this. The gap is in tooling and standards: no current safety certification framework requires per-expert analysis, and no benchmark suite tests for routing-level robustness.

Open questions

Transferability is the biggest unknown. Both frameworks demonstrate expert-level steering on specific open-source MoE models. Whether the identified safety experts generalize across model families, training runs, or quantization levels is not established. If safety experts shift during fine-tuning or continued pretraining, every fine-tune could create a new set that needs independent identification.

The closed-model question is unresolved. Black-box output access does not directly reveal expert activation patterns, but side-channel analysis (timing differences, output distribution shifts under adversarial prompting) may provide enough signal to approximate which routing paths are safety-relevant. Whether that approximation is sufficient for effective steering is speculative.

Both papers measure on benchmarks. Benchmark safety scores are a proxy for real-world refusal behavior, and the gap between the two is well-documented in the alignment literature. A model that scores 83.9% on a multi-turn jailbreak defense benchmark may still fail on adversarial prompts outside the benchmark’s distribution, with or without expert-level steering applied.

Frequently Asked Questions

How does expert-level steering differ from activation steering in dense models?

In dense LLMs, activation steering modifies continuous hidden-state representations across all layers, requiring empirical selection of injection points and scaling factors. Expert-level steering exploits the discrete routing structure unique to MoE architectures: you toggle which expert modules receive tokens, not the representations inside them. This makes the intervention more surgical but entirely architecture-dependent. The methodology cannot transfer to a dense model because there are no routing gates to manipulate.

What does identifying safety experts actually cost in compute?

SteerMoE’s identification step requires a single forward pass over paired safe and unsafe prompt sets while recording per-expert activation frequencies. No gradient computation or backpropagation is involved, putting the cost on par with one benchmark evaluation run. MASCing’s surrogate training and mask optimization is more expensive but runs once offline. After that, applying the steering mask to routing gates adds negligible per-token overhead during inference. The barrier is methodological expertise, not hardware.

Do identified safety experts survive quantization or compression?

Neither SteerMoE nor MASCing tests whether expert-level safety behavior persists through common compression methods such as INT4, GPTQ, or AWQ. Quantization can shift routing decisions by changing the numerical precision of gating computations, which means the safety-associated experts identified in a full-precision model may not correspond to the same modules in a compressed variant. Each quantized checkpoint would likely require an independent identification pass, multiplying the auditing burden for deployers who distribute multiple model formats.

What is the strategic problem with publicly hardening safety experts?

Both papers demonstrate that the same paired-input methodology used to identify and reinforce safety experts also reveals which experts to suppress. If a deployer publishes the results of per-expert safety hardening, or if the hardened model’s routing patterns can be recovered through probing, the attacker’s search space collapses from the full model to a handful of identified modules. The arXiv cs.CL listings alone show 228 entries on a single day in June 2026, indicating that routing-level safety research is producing multiple independent frameworks for expert identification faster than defenses against expert suppression can be developed.