Pruning Experts to Shrink MoE Models: Does Attribution-Guided Compression Beat Magnitude?

The sources at hand do not directly answer whether attribution-guided pruning beats magnitude pruning for mixture-of-experts models. The most relevant paper, arXiv:2606.20544, is about calibration under distribution shift, not compression. Until the target pruning paper is fetched and checked, claims about shrinking DeepSeek V4, Qwen, or GLM checkpoints by maximizing attribution coverage should be treated as unverified. The immediate takeaway is that routing architecture determines whether dropping experts is even safe.

What do the fetched sources actually say about MoE calibration?

They establish conditions under which a calibrated MoE stays calibrated when the data distribution shifts. In hard-routed MoEs, calibrating each expert is enough to guarantee that the full model is calibrated. In soft-routed MoEs, the same guarantee does not hold: individually calibrated experts can still combine into an uncalibrated aggregate. The paper also proposes an adversarial reweighting method that penalizes calibration errors of the routed aggregate under distribution shift, reporting improvements in the accuracy-calibration tradeoff across model classes, tasks, and shifts arXiv:2606.20544.

Calibration here means the model’s predicted probabilities match the empirical frequency of outcomes. The paper situates this as a prerequisite for trusting model confidence, and notes that enforcing calibration at the individual-expert level has previously been shown to improve both accuracy and calibration in MoE settings arXiv:2606.20544.

Why does hard vs. soft routing change the compression safety argument?

Hard routing lets you reason about the whole model one expert at a time, while soft routing does not. In a hard-routed MoE, each token is dispatched to a subset of experts and the final output is built from those discrete choices. If every retained expert is calibrated and the routing decision is preserved, the model-level calibration argument carries over. In a soft-routed MoE, the gating network produces a weighted combination of expert outputs, so the aggregate distribution is a mixture whose calibration depends on the weights, not just on each component arXiv:2606.20544.

This distinction matters for pruning because removing an expert changes both the capacity and the mixture weights. A magnitude-based prune might drop experts with small outgoing weights, but if the remaining experts are reweighted to compensate, calibration can drift. An attribution-guided prune would presumably select experts to retain based on their contribution to some coverage objective, yet the fetched sources do not define that objective or report its accuracy-calibration tradeoffs.

How has attribution divergence been used to assess model reliability?

Attribution divergence has been used to detect when models disagree for structurally informative reasons, but the existing source studies clinical tabular prediction, not MoE expert selection. In that work, a cross-model calibrator using attribution divergence signals reduced expected calibration error from 0.254 to 0.080 and replaced uninformative verbalized confidence estimates with patient-specific reliability estimates, all without accessing model internals arXiv:2606.19509.

The result suggests that attribution-based disagreement can be a useful signal for reliability, but it does not establish that the same signal should guide which MoE experts to keep. The domain, model architecture, and task all differ from open-weight language-model MoEs.

What should practitioners verify before pruning experts?

Before dropping experts from a deployed MoE checkpoint, verify three things from the target distribution. First, confirm whether the router is hard or soft, because the calibration guarantee only transfers cleanly in the hard case arXiv:2606.20544. Second, check that the retained experts are calibrated, and ideally stress-test them with an adversarial reweighting guard that penalizes aggregate calibration errors under distribution shift arXiv:2606.20544. Third, look beyond accuracy: two pruned configurations can score similarly on standard metrics yet differ substantially in logical compliance, which accuracy alone does not capture arXiv:2606.20208.

What would the missing pruning paper need to show?

To justify the title’s claim, the missing study would need to compare expert retention strategies on named MoE checkpoints and report retained accuracy, calibration error, and memory reduction for both hard- and soft-routed variants. Specifically, it would need to show that selecting experts by attribution coverage dominates a magnitude baseline on the same benchmark suite, and that the advantage holds after distribution shift. The current fetched sources do not contain that comparison. Until they do, the safest position is that hard-routed MoEs can be compressed in principle once experts are calibrated, while soft-routed ones require an additional aggregate-level guard, and attribution-guided pruning remains a plausible but unproven strategy.

Frequently Asked Questions

Does the calibration guarantee apply to any sparse MoE, or only to hard-routed ones?

It applies specifically to hard-routed MoEs, and the paper shows it holds across a broad class of distribution shifts. In that routing regime, calibrating each expert is sufficient to keep the full model calibrated. Soft-routed MoEs mix expert outputs continuously, so the same per-expert calibration does not transfer to the aggregate.

How is adversarial reweighting different from just temperature-scaling the final probabilities?

Temperature scaling calibrates the aggregate output but does not explicitly penalize calibration errors under distribution shift. The adversarial reweighting method targets the routed aggregate’s calibration error under shift, and the paper reports improvements in the accuracy-calibration tradeoff across multiple model classes, tasks, and distribution shifts.

What ongoing cost does serving a pruned hard-routed MoE impose?

Teams need to keep a calibration-validation loop running on the target distribution and, ideally, an adversarial reweighting guard on the aggregate. The related clinical attribution work shows attribution divergence can cut expected calibration error from 0.254 to 0.080, but that result was on tabular data and did not access model internals, so it cannot be dropped into language MoEs without revalidation.

Where can a pruned MoE pass accuracy tests and still be unsafe?

Two pruned configurations can have nearly identical accuracy yet differ substantially in logical compliance, which standard accuracy metrics do not capture. In soft-routed MoEs, the aggregate can also become uncalibrated even when every retained expert is individually calibrated, because removing experts changes the mixture weights.

What would the missing pruning paper need to show to justify attribution-guided compression?

It would need to compare attribution-guided and magnitude-based expert retention on named checkpoints such as DeepSeek V4, Qwen, or GLM, reporting retained accuracy, calibration error, and memory reduction for both hard- and soft-routed variants after distribution shift. Without that comparison, attribution-guided pruning is plausible but unproven.