Every Transformer layer in a pretraining run gets the same learning rate. That convention is cheap to implement and easy to tune, but it papers over a structural asymmetry: different layer types develop structurally different spectral profiles during training, so a single step size is too aggressive for some layers and too conservative for others. arXiv:2605.22297, accepted at ICML 2026, proposes a fix grounded in random-matrix theory, and the numbers suggest it works.
Uniform LR leaves layers behind
Standard pretraining recipes (AdamW with cosine decay, warmup, and a global peak LR) treat every parameter matrix identically. This works well enough when all you need is convergence, but Transformer layer types do not behave the same way under gradient updates. Their spectral profiles diverge as training progresses. A single LR splits the difference.
The practical consequence: some layers overshoot into noisy territory, while others never reach their optimal basin. The model still converges, because the loss surface is forgiving enough at the scales people train. It converges to a worse minimum than it could.
How LLR reads the spectral tail
The paper’s method, called LLR (Layer-wise Learning Rate), draws on Heavy-Tailed Self-Regularization (HT-SR) theory. The core diagnostic measures the heavy-tailedness of each layer’s empirical spectral density (ESD), computed from its weight correlation matrix. Layers with stronger heavy-tailedness get assigned smaller learning rates; layers with weaker heavy-tailedness get larger ones.
A heavy-tailed ESD indicates that a layer’s weight matrices are more correlated and more structurally rigid. Large gradient steps in such layers risk perturbing already-settled structure. Lighter-tailed layers have more room to move and benefit from larger steps.
Results across scales
The paper runs experiments on LLaMA architectures from 60M to 3B parameters, trained on FineWeb data up to 100B tokens, using both AdamW and Muon optimizers. The headline numbers:
| Scale | Metric | Uniform LR | LLR | Delta |
|---|---|---|---|---|
| 1B | Avg zero-shot accuracy | 47.09% | 49.02% | +1.93 pp |
| 3B | Avg zero-shot accuracy | 48.58% | 50.61% | +2.03 pp |
The 1.5x training speedup means LLR reaches the uniform baseline’s final accuracy in roughly two-thirds of the training steps. Both optimizers show consistent gains, which matters because Muon is gaining traction as an AdamW alternative in research settings.
Why prior layerwise methods did not stick
Per-layer learning rates are not a new idea. LARS, LAMB, and the blockwise scheme from Wang et al. (2025) all tried to address layer heterogeneity. The v2 revision benchmarks LLR against all of them and identifies a common failure mode: prior methods only beat the uniform baseline when that baseline was tuned to a suboptimal LR. Against a properly tuned uniform schedule, they lost.
Among the methods benchmarked, LLR is the only one to outperform a properly tuned uniform LR schedule. The difference appears to come from the diagnostic quality of the spectral signal. Methods that rely on gradient norm or weight norm heuristics conflate “large update” with “needs a large update.” The spectral heavy-tailedness metric separates structural rigidity from mere scale.
The paper does not benchmark against muP (maximal update parameterization), which also adjusts per-layer LR scaling through a different theoretical lens. Whether the two approaches are complementary or redundant is an open question.
Adoption friction is low
One of the more practical aspects of LLR: the per-layer rates transfer nearly optimally from the uniform baseline. A team that has already tuned peak LR, warmup, and cosine decay for a given architecture can plug in LLR without restarting the hyperparameter search from scratch. The public code implements the method in a form compatible with standard PyTorch training loops, and the spectral diagnostic runs at periodic checkpoints (not every step), keeping overhead bounded.
The complementary angle: GPAS (arXiv:2506.22049), a concurrent technique that scales down activations to fix Pre-LN variance growth, addresses a different failure mode (residual-stream signal drowning) but targets the same class of problem. LLR mitigates optimizer step-size heterogeneity. Together, they suggest that vanilla AdamW + cosine decay leaves multiple independent sources of accuracy on the table.
Framework defaults will need to catch up
The second-order consequence is not about any single paper. It is about the configuration surface that training frameworks expose.
Megatron-LM, torchtitan, and PyTorch’s native AdamW all surface warmup, peak LR, and cosine/min-decay as top-level knobs. Per-layer LR multipliers exist in PyTorch via param groups, but they are not wired into default training recipes. If LLR-style diagnostics become standard practice, framework configs will need first-class support for per-layer LR schedules, not just warmup/cosine knobs plus a manual param-group override.
The overhead argument cuts both ways. At 1B parameters, ESD computation is negligible. At 70B, nobody has published wall-clock numbers. Until that gap closes, the adoption path is: try LLR at small scale, verify gains, then decide whether the spectral diagnostic cost is acceptable at production scale. The theory is sound. The engineering question is whether the compute tax at frontier scales is worth the accuracy improvement, and the answer depends on data the community does not have yet.
Frequently Asked Questions
Which layer types get reduced rates and which get increased ones?
Attention projection layers (query and key) exhibit stronger heavy-tailed spectral profiles and receive smaller learning rates. Embedding layers and feed-forward network blocks exhibit weaker heavy-tailedness and receive larger rates. The specific metric driving these assignments is PL_Alpha_Hill, computed from the empirical spectral density of each layer’s weight correlation matrix.
How does this differ from muP’s per-layer scaling?
muP (maximal update parameterization) sets per-layer learning rates through width-dependent parameterization rules derived from infinite-width theory. LLR sets them through runtime spectral diagnostics that respond to actual training dynamics of a specific run. The two approaches come from different theoretical foundations (tensor programs vs. random-matrix theory) and the paper does not benchmark them head-to-head, so whether they capture the same phenomenon or complement each other remains unknown.
Does LLR help with fine-tuning or only pretraining?
The paper evaluates LLR exclusively in the pretraining regime (random initialization through 100B tokens on FineWeb). Fine-tuning involves a different loss landscape where most layers are already near a good basin and the spectral signatures may differ substantially from pretraining dynamics. Whether the same heavy-tail diagnostic transfers to supervised fine-tuning or RLHF alignment stages is untested.
What is the practical difference between soft and hard LR switching?
LLR can transition per-layer rates either abruptly (hard switch) or gradually (soft switch) at scheduled checkpoints. On the 135M model, soft switching improves over hard switching by only 0.15 perplexity points. That margin may narrow further at larger scales, which would simplify implementation since hard switching requires no interpolation logic between checkpoints.
Can LLR and GPAS be combined in the same training run?
They address independent failure modes: LLR fixes per-layer step-size mismatch, while GPAS fixes Pre-LN variance growth in the residual stream. No published experiment combines both, but no theoretical conflict exists. A combined approach would add two separate diagnostic and modification passes per checkpoint, and the compute overhead for both together at frontier scales (70B+) is uncharacterized.