One Learning Rate Doesn't Fit All: Heavy-Tail Layerwise LR Schedules for LLM Pretraining

Every Transformer layer in a pretraining run gets the same learning rate. That convention is cheap to implement and easy to tune, but it papers over a structural asymmetry: different layer types develop structurally different spectral profiles during training, so a single step size is too aggressive for some layers and too conservative for others. arXiv:2605.22297, accepted at ICML 2026, proposes a fix grounded in random-matrix theory, and the numbers suggest it works.

Uniform LR leaves layers behind

Standard pretraining recipes (AdamW with cosine decay, warmup, and a global peak LR) treat every parameter matrix identically. This works well enough when all you need is convergence, but Transformer layer types do not behave the same way under gradient updates. Their spectral profiles diverge as training progresses. A single LR splits the difference.

The practical consequence: some layers overshoot into noisy territory, while others never reach their optimal basin. The model still converges, because the loss surface is forgiving enough at the scales people train. It converges to a worse minimum than it could.

How LLR reads the spectral tail

The paper’s method, called LLR (Layer-wise Learning Rate), draws on Heavy-Tailed Self-Regularization (HT-SR) theory. The core diagnostic measures the heavy-tailedness of each layer’s empirical spectral density (ESD), computed from its weight correlation matrix. Layers with stronger heavy-tailedness get assigned smaller learning rates; layers with weaker heavy-tailedness get larger ones.

A heavy-tailed ESD indicates that a layer’s weight matrices are more correlated and more structurally rigid. Large gradient steps in such layers risk perturbing already-settled structure. Lighter-tailed layers have more room to move and benefit from larger steps.

What PL_Alpha_Hill actually measures

The metric driving every per-layer rate is PL_Alpha_Hill, a power-law exponent fit to the tail of each weight matrix’s eigenvalue spectrum using a Hill estimator. Heavy-Tailed Self-Regularization theory, developed by Martin and Mahoney from the empirical observation that well-trained networks accumulate a power-law tail in their weight spectra, reads a small alpha (a heavier tail) as a layer that has done more feature learning and settled into correlated structure. A large alpha (a lighter, more Gaussian tail) marks a layer that still looks close to its random initialization. LLR inverts that signal into a step size: low alpha gets a smaller rate, high alpha gets a larger one.

That framing matters because it separates two things gradient-norm and weight-norm heuristics conflate. A layer can have large updates because it is genuinely undertrained, or because it is rigid and being shoved around by an oversized step. The spectral tail distinguishes the case. It is also why a fixed rule like “give attention 0.8x” misses: the heavy-tailedness of a given layer drifts over the run, and the assignment has to drift with it.

LLR is not the first method to turn this signal into a hyperparameter. TempBalance (NeurIPS 2023) already used ESD-derived alpha to schedule per-layer learning rates, but it was validated on ResNets and VGGs over CIFAR and TinyImageNet, not Transformer pretraining. AlphaDecay (NeurIPS 2025) applied the same diagnostic to weight decay rather than learning rate, assigning weaker decay to heavy-tailed modules in LLM pretraining up to 1B. LLR is the learning-rate analog carried to LLM scale, and the honest framing is that it extends an existing HT-SR line rather than inventing spectral-guided LR from scratch. The contribution is showing the signal holds up against a properly tuned Transformer baseline where earlier layerwise schemes did not.

Results across scales

The paper runs experiments on LLaMA architectures from 60M to 3B parameters, trained on FineWeb data up to 100B tokens, using both AdamW and Muon optimizers. The headline numbers:

Scale	Metric	Uniform LR	LLR	Delta
1B	Avg zero-shot accuracy	47.09%	49.02%	+1.93 pp
3B	Avg zero-shot accuracy	48.58%	50.61%	+2.03 pp

The 1.5x training speedup is concrete in the token budget: LLR matches the uniform baseline’s final quality at roughly 4.5B and 10.5B tokens where uniform needs the full run, so it reaches the target in about two-thirds of the steps. Both optimizers show consistent gains, which matters more than it first looks for Muon. Muon already orthogonalizes its momentum updates through a Newton-Schulz iteration, a spectral operation in its own right, so a layerwise rate that also reads the weight spectrum could in principle be redundant. The fact that LLR still lowers Muon’s validation perplexity at 60M and 135M says the two are correcting different things: Muon conditions the update direction, LLR sizes the step per layer.

Why prior layerwise methods did not stick

Per-layer learning rates are not a new idea. LARS, LAMB, and the blockwise sharpness scheme from Wang et al. (2025) all tried to address layer heterogeneity. The v3 revision benchmarks LLR against all of them and identifies a common failure mode: prior methods only beat the uniform baseline when that baseline was tuned to a suboptimal LR. Against a properly tuned uniform schedule, they lost.

Among the methods benchmarked, LLR is the only one to outperform a properly tuned uniform LR schedule. The difference appears to come from the diagnostic quality of the spectral signal. Methods that rely on gradient norm or weight norm heuristics conflate “large update” with “needs a large update.” The spectral heavy-tailedness metric separates structural rigidity from mere scale.

The v3 revision widened the comparison set in a way the earlier draft did not. [Updated June 2026] It now includes muP variants (the table lists Mup-AdamW and CompleteP-AdamW), AdamMini, and two methods from the same Heavy-Tailed Self-Regularization line (AlphaDecay and TempBalance). LLR reports outperforming all of them. That closes the open question the original μP comparison raised: muP scaling and spectral LLR are no longer untested against each other in this paper, and the spectral diagnostic comes out ahead on the reported runs. The deeper theoretical question, whether width-derived muP scaling and runtime ESD diagnostics capture the same phenomenon or different ones, the paper still does not resolve, because a baseline table is not a mechanistic comparison.

Adoption friction is low

One of the more practical aspects of LLR: the per-layer rates transfer nearly optimally from the uniform baseline. A team that has already tuned peak LR, warmup, and cosine decay for a given architecture can plug in LLR without restarting the hyperparameter search from scratch. The public code implements the method in a form compatible with standard PyTorch training loops, and the spectral diagnostic runs at periodic checkpoints (not every step), keeping overhead bounded.

The complementary angle: GPAS (arXiv:2506.22049), a concurrent technique that scales down activations to fix Pre-LN variance growth, addresses a different failure mode (residual-stream signal drowning) but targets the same class of problem. LLR mitigates optimizer step-size heterogeneity. Together, they suggest that vanilla AdamW + cosine decay leaves multiple independent sources of accuracy on the table.

Framework defaults will need to catch up

The second-order consequence is not about any single paper. It is about the configuration surface that training frameworks expose.

Megatron-LM, torchtitan, and PyTorch’s native AdamW all surface warmup, peak LR, and cosine/min-decay as top-level knobs. Per-layer LR multipliers exist in PyTorch via param groups, but they are not wired into default training recipes. If LLR-style diagnostics become standard practice, framework configs will need first-class support for per-layer LR schedules, not just warmup/cosine knobs plus a manual param-group override.

The overhead argument cuts both ways. At 1B parameters, ESD computation is negligible. At 70B, nobody has published wall-clock numbers. Until that gap closes, the adoption path is: try LLR at small scale, verify gains, then decide whether the spectral diagnostic cost is acceptable at production scale. The theory is sound. The engineering question is whether the compute tax at frontier scales is worth the accuracy improvement, and the answer depends on data the community does not have yet.

Frequently Asked Questions

Which layer types get reduced rates and which get increased ones?

All four attention projections (query, key, value, and output, written Att.q/k/v/o in the paper) exhibit stronger heavy-tailed spectral profiles and receive smaller learning rates. [Updated June 2026] Embedding layers and feed-forward network blocks exhibit weaker heavy-tailedness and receive larger rates. The specific metric driving these assignments is PL_Alpha_Hill, a Hill-estimator power-law fit to the tail of each layer’s empirical spectral density, computed from its weight correlation matrix.

How does this differ from muP’s per-layer scaling?

muP (maximal update parameterization) sets per-layer learning rates through width-dependent parameterization rules derived from infinite-width theory. LLR sets them through runtime spectral diagnostics that respond to actual training dynamics of a specific run. The two approaches come from different theoretical foundations (tensor programs vs. random-matrix theory). The v3 revision now includes muP variants (Mup-AdamW and CompleteP-AdamW) as baselines and reports LLR outperforming them on the pretraining runs. [Updated June 2026] That settles the empirical horse race on this benchmark but not the mechanistic question of whether the two methods are correcting the same underlying mismatch.

Does LLR help with fine-tuning or only pretraining?

The paper evaluates LLR exclusively in the pretraining regime (random initialization through 100B tokens on FineWeb). Fine-tuning involves a different loss landscape where most layers are already near a good basin and the spectral signatures may differ substantially from pretraining dynamics. Whether the same heavy-tail diagnostic transfers to supervised fine-tuning or RLHF alignment stages is untested.

What is the practical difference between soft and hard LR switching?

LLR can transition per-layer rates either abruptly (hard switch) or gradually (soft switch) at scheduled checkpoints. On the 135M model, soft switching improves over hard switching by only 0.15 perplexity points. That margin may narrow further at larger scales, which would simplify implementation since hard switching requires no interpolation logic between checkpoints.

Is LLR the first method to set learning rates from spectral diagnostics?

No. TempBalance (NeurIPS 2023) already scheduled per-layer learning rates from the same Heavy-Tailed Self-Regularization alpha, and AlphaDecay (NeurIPS 2025) applied the diagnostic to per-module weight decay in LLM pretraining. LLR’s contribution is narrower and more useful than novelty: it carries the spectral-LR idea to Transformer pretraining up to 3B parameters and shows it survives against a properly tuned uniform baseline and against muP variants, which is the bar earlier layerwise schemes failed to clear.

Can LLR and GPAS be combined in the same training run?

They address independent failure modes: LLR fixes per-layer step-size mismatch, while GPAS fixes Pre-LN variance growth in the residual stream. No published experiment combines both, but no theoretical conflict exists. A combined approach would add two separate diagnostic and modification passes per checkpoint, and the compute overhead for both together at frontier scales (70B+) is uncharacterized.