μP Hyperparameter Transfer Has an Embedding Layer Hole, New arXiv Paper Says

A new paper on arXiv isolates the single largest reason μP (Maximal Update Parameterization) outperforms standard parameterization with AdamW: the embedding layer learning rate. Teams trusting μP’s default hyperparameter transfer end-to-end have been leaving optimization headroom on the table, and teams that never adopted μP can recover most of its benefit with a single config change.

Three metrics for transfer quality

Dayal Singh Kalra’s arXiv:2605.21486, posted May 20, introduces a framework for evaluating hyperparameter transfer along three axes: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty from parameterization choice. Previous work demonstrated that μP transfer works in practice; Cerebras’ practitioner guide reports roughly 2× compute savings from μTransfer, but didn’t decompose why it works or where it breaks.

The three-metric separation matters because transfer quality is not a single scalar. A set of hyperparameters can produce clean scaling law fits (good loss predictions at larger widths) while being brittle under extrapolation (bad predictions when the width gap between proxy and target grows). The paper measures both and finds they don’t always move together.

The embedding bottleneck

Standard parameterization (SP) scales all layer learning rates uniformly as model width grows. Under SP with AdamW, the embedding layer learning rate becomes the bottleneck. It stays too low relative to what the rest of the network needs, and the mismatch induces training instabilities that degrade transfer quality.

μP avoids this by applying different learning-rate scaling rules to different parameter types. The paper’s ablation shows that the overwhelming benefit of μP over SP comes from one effect: maximizing the embedding layer learning rate. The rest of μP’s parameterization machinery contributes, but the embedding layer dominates.

The one-line fix

Scaling the embedding layer learning rate by width, matching μP’s implicit scaling, smooths training and improves hyperparameter transfer. This recovers most of μP’s gains without adopting the full parameterization.

For teams already running μP, nothing changes. For teams on SP with hand-tuned learning rates, the paper identifies a specific knob: the embedding learning rate multiplier, which should track with model width. The paper quantifies the gap between default SP behavior and μP-optimal behavior at each scale.

Weight decay’s double behavior

Adding weight decay improves the quality of scaling law fits. It regularizes the loss landscape and reduces variance in the scaling curve. But in the fixed-tokens-per-parameter regime, weight decay hurts extrapolation robustness. The two effects move in opposite directions, and teams using μTransfer may not notice.

The standard workflow tunes hyperparameters on a small proxy model and extrapolates to the target. If weight decay improves fit quality on the proxy but degrades extrapolation to the target, the extrapolated hyperparameters end up suboptimal. The paper’s three-metric framework catches this because it evaluates fit and robustness separately.

Before your next training run

The paper implies a concrete audit for teams running or considering μP:

If you’re already on μP: no action needed. The embedding LR scaling is handled by the framework. Verify your implementation is current; the Microsoft mup repository tracks updates.
If you’re on SP: scale the embedding learning rate by width before concluding μP is necessary. The paper shows this single change recovers most of the transfer benefit.
If you’re extrapolating across width gaps: evaluate weight decay’s effect on fit quality and extrapolation robustness separately. Better fits don’t guarantee better predictions.
If you’re using an optimizer other than AdamW: the paper’s ablations focus on AdamW. The embedding bottleneck may or may not apply to SGD, Lion, or others. Treat the findings as AdamW-specific until verified.

What this means for the Tensor Programs line

μP emerged from Greg Yang’s Tensor Programs series, which proved that certain parameterizations preserve activation statistics across width and enable zero-shot hyperparameter transfer. The original paper demonstrated that tuning a 40M-parameter proxy matched GPT-3 6.7B performance at 7% of the pretraining compute cost.

Kalra’s paper doesn’t challenge those results. It explains them. The finding that the embedding layer accounts for the dominant share of μP’s advantage suggests the Tensor Programs theory correctly identified the right parameterization but didn’t fully explain why it works in practice. The theoretical guarantee is broad; the practical mechanism is narrow and resides mostly in one layer.

That’s useful for two reasons. It gives practitioners a simpler mental model: μP’s benefit is mostly “scale the embedding learning rate with width.” And it identifies a target for further theoretical work. If the embedding layer is where the action is, the next round of theory should explain why SP under-scales this layer specifically, and whether the same bottleneck appears in architectures beyond transformers.

Frequently Asked Questions

Does the embedding bottleneck show up in state-space models or CNNs?

The ablations run exclusively on decoder-only transformers with AdamW. Architectures like Mamba, RWKV, and convolutional networks are not tested. The paper identifies cross-architecture validation as a target for follow-up work, so the finding should not be assumed to generalize beyond transformers.

If I only scale the embedding LR, how much of Cerebras’ 2x compute savings do I keep?

Cerebras’ 2x figure comes from full μTransfer, where all parameter types receive μP-specific scaling. The paper shows the embedding LR alone accounts for the dominant share of the transfer benefit, but does not publish a compute-savings number for the partial fix. Teams using only the embedding adjustment should expect a significant improvement over stock SP but may still see marginal gains from adopting μP’s remaining parameter-specific rules.

Does the embedding bottleneck exist under SGD or Lion?

The entire ablation suite uses AdamW. SGD scales gradients by a global learning rate without per-parameter second-moment normalization, and Lion uses sign-based updates rather than adaptive estimates, so the mechanism that creates the bottleneck under AdamW could manifest differently or not at all. The finding should be treated as AdamW-specific until replicated with other optimizers.

Where do I find the exact embedding LR multiplier for my model width?

The quantitative scaling rule is in the full paper (arXiv:2605.21486, 38 pages), not in the abstract. The correct multiplier depends on the width relationship between your proxy and target models. Rather than deriving it manually, the Microsoft mup library on GitHub applies the scaling rules, including the embedding layer treatment, automatically when wrapping a model.