What scale vectors actually are
Inside every modern transformer, sandwiched between attention heads and feed-forward layers, sit a handful of learned scale parameters in normalization layers (LayerNorm, RMSNorm). They account for a negligible fraction of total parameters. According to a paper submitted May 26, removing them substantially degrades pre-training performance. The finding reframes these parameters not as passive normalizers but as optimization preconditioners with outsized influence on model behavior.
Normalization layers have two components: a fixed operation that centers and scales activations, and a learnable vector (the “scale vector,” sometimes called the gamma or weight parameter) that re-scales each feature dimension. The fixed part does the statistical work. The scale vector is where the model learns something. And until now, that “something” was poorly understood. The new work by Mingze Wang et al. provides the first systematic study of what these vectors actually do during training and why their influence is so disproportionate to their size.
Optimization, not expressivity
The central finding is architectural and somewhat counterintuitive. In Pre-Norm transformers (the dominant architecture for current LLMs), scale vectors do not increase model expressivity. They do not expand the set of functions the network can represent. Instead, they improve optimization through what the authors describe as a self-amplifying preconditioning effect on the linear mappings that follow each normalization layer.
This matters because it recasts scale vectors from “just another set of learned weights” into a structural lever on the training dynamics. The scale vector adjusts the magnitude of activations flowing into the next linear layer, and that linear layer’s weight magnitudes co-adapt with it during training. Remove the scale vector, and the preconditioning collapses. The downstream weights lose their calibrated input scale, and training degrades even though the model’s representational capacity is technically unchanged.
The paper further distinguishes between Input-Norm and Output-Norm layers, and finds that weight decay interacts with them in opposite directions. Weight decay is beneficial for Input-Norm scale vectors but actively harmful for Output-Norm scale vectors, owing to their distinct roles. This has direct implications for anyone designing training recipes: a single weight-decay hyperparameter applied uniformly across all normalization layers is, according to these results, leaving performance on the table.
Three improvements, one unified strategy
Building on the mechanistic understanding of scale vectors, the paper proposes three lightweight modifications:
Branch-specific heterogeneity. Rather than sharing scale vectors across branches of a residual connection, give each branch its own. The parameters are negligible in count, so the added cost is effectively zero.
Improved placement around linear mappings. Positioning normalization layers more deliberately relative to the linear layers they precondition yields consistent gains, aligning with the preconditioning interpretation.
Magnitude-direction reparameterization. Decomposing the scale vector into separate magnitude and direction components allows the optimizer to adjust each independently. The authors report consistent individual gains from each modification.
The unified strategy, combining all three, was evaluated on dense and mixture-of-experts architectures from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The result: consistently lower terminal pre-training loss than well-tuned baselines, with negligible parameter overhead (Wang et al., arXiv:2605.26895).
Why quantization pipelines should care
Current LLM quantization techniques, including SmoothQuant (W8A8 post-training quantization), QuIP (2-bit weight compression), and EfficientQAT, treat all weight categories uniformly. A weight matrix gets quantized to the target precision regardless of whether it belongs to an attention head, a feed-forward layer, or a normalization scale vector.
The scale-vector findings suggest this uniformity is a liability. If scale vectors act as preconditioners whose magnitude carries disproportionate optimization signal, then compressing them at the same bit-width as generic weights risks degrading precisely the parameters that matter most for training dynamics. The parameter count is negligible, so protecting scale vectors at higher precision while aggressively quantizing everything else would add almost nothing to the memory footprint.
This has not been directly tested. The paper demonstrates that removing scale vectors degrades training; the quantization-risk argument is an inference from that result. But the economics are straightforward. If you are selecting which parameters to protect in a mixed-precision scheme, the parameters shown to have outsized influence on optimization are a logical place to start, and the overhead of protecting them is near-zero because there are so few of them.
Why interpretability and safety teams should care
Mechanistic interpretability tooling has made progress identifying circuits and features inside transformers. Libraries like TransformerLens, nnsight, SAE Lens, Pyvene, and repeng (catalogued in this community resource) target activation-level features and learned feature directions, typically discovered through sparse autoencoders or probing classifiers.
Scale vectors represent a different kind of target. They are architecturally defined, not learned post-hoc. You do not need to discover them with an autoencoder or train a probe to find them. Their location in the network is known a priori (every normalization layer), their parameter count is tiny, and their influence on behavior is outsized. For safety teams looking to make targeted behavioral edits, a parameter class that is both locatable and influential is a concrete lever.
This is distinct from task vectors, activation steering vectors, and control vectors from representation engineering. Those are computed or extracted after training and applied as interventions. Scale vectors are learned during training, baked into the model’s weight structure, and their behavioral influence flows from their role in the optimization process itself.
What remains untested
The paper leaves several important questions open. The most obvious: does the unified scale-vector strategy hold at frontier scales? The experiments stop at 2B parameters. The computational cost of confirming at 70B or 405B is nontrivial, and the interaction between scale-vector preconditioning and the distributed training regimes used at those scales is unknown.
No downstream task benchmarks are reported. Pre-training loss is a proxy, and a useful one, but practitioners need to know whether the loss improvement transfers to measurable gains on the evaluations their products depend on.
Most critically for the compression and safety angles: no quantization experiments incorporating scale-vector-aware precision schemes have been conducted. The argument that uniform quantization risks clobbering these parameters is sound in principle, but the actual degradation from quantizing scale vectors at a given bit-width has not been measured. Until someone runs the ablation, the quantization-risk claim remains an inference, not a result.
Frequently Asked Questions
Does the scale-vector analysis apply to Post-Norm transformers, or only Pre-Norm?
The expressivity finding is specific to Pre-Norm designs, which dominate current LLMs. In Post-Norm architectures, normalization is applied after the residual addition rather than before, which changes how the preconditioning effect propagates through subsequent layers. The paper does not isolate Post-Norm behavior separately, so whether scale vectors play a different role there remains unknown.
What would a training team need to change to act on the weight-decay finding?
Most training runs use a single weight-decay coefficient applied globally through AdamW. The Input-Norm vs Output-Norm result requires per-layer-group decay schedules, which PyTorch supports via parameter groups but which few production training recipes configure at this granularity. The code change is small, but it demands independent re-tuning of decay values for each group, effectively multiplying the hyperparameter search space.
Could the 2B parameter ceiling hide problems that appear at frontier scales?
Models at 70B and above are trained with tensor and pipeline parallelism, distributing scale vectors across devices and synchronizing their gradients across thousands of GPUs. The self-amplifying preconditioning effect could behave differently when gradients are averaged across nodes rather than computed on a single device. The interaction between scale-vector preconditioning and distributed training regimes has no empirical data behind it.
What would scale-vector-aware quantization look like in shipping inference engines?
Protecting scale vectors at FP32 while compressing everything else to INT4 or lower would add negligible memory given their parameter count. However, inference engines such as vLLM and TensorRT-LLM do not expose mixed-precision hooks at normalization-layer granularity. Implementing the pattern requires custom kernels that load and compute with two precision levels inside a single forward pass, a non-trivial systems effort despite the tiny parameter slice involved.
How does magnitude-direction reparameterization relate to adaLN-style normalization in diffusion transformers?
Adaptive layer normalization (adaLN), widely used in diffusion transformers, also decomposes normalization parameters, but conditions them on external inputs like timestep embeddings rather than treating them as optimization levers. The two mechanisms target different problems: adaLN for conditioning, magnitude-direction reparameterization for training dynamics. Whether combining them yields compounding benefits has not been tested.