Does Softmax Normalization Limit What Attention Can Represent?

Softmax normalization, the operation that converts raw attention scores into a probability distribution, imposes a provable ceiling on what attention can represent. A paper by Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, and Radu State from the University of Luxembourg formalizes this: softmax’s normalization step constrains the geometric separation between token vectors, and those constraints tighten as the number of attended tokens grows. Accepted at NeurIPS 2025, the result is not subtle. Teams building on softmax attention stacks are working inside a function class whose boundaries are now known, and whose alternatives are further along than widely assumed. [Updated June 2026]

Softmax’s geometric separation ceiling

The core argument in the paper is structural, not empirical. The authors develop a theoretical framework showing that softmax normalization imposes explicit bounds on the distances and separation criteria that token vectors can maintain after the attention operation. The softmax function maps unbounded logits to a probability simplex, and that mapping, by construction, compresses the range of distinguishable inputs. This is not a conjecture about particular weight configurations; it is a property of the normalization step itself, independent of model width or depth.

The framing matters because softmax has historically been treated as fixed scaffolding in architecture design. Search spaces for neural architecture search, scaling-law studies, and efficiency-focused variants (linear attention, FlashAttention, grouped-query attention) all retain softmax while optimizing around it. The paper argues the scaffolding is load-bearing in a way that limits the structure it supports.

The uniform-convergence problem

The paper’s most immediately actionable finding concerns what happens as the number of selected tokens increases. The authors demonstrate that the attention mechanism’s ability to distinguish informative tokens declines with the size of the attended set, converging toward a uniform selection pattern. In the limit, softmax attention assigns roughly equal weight to all tokens in the context, regardless of relevance.

This connects to a known phenomenon called “over-smoothing”: after softmax normalization, weight distributions become relatively uniform, assigning non-trivial weight to irrelevant or secondary information and reducing the model’s ability to focus sharply on key tokens. What the paper adds is a formal proof that this is not an implementation artifact or a training failure, but a mathematical consequence of the normalization choice. The more tokens you attend to, the less selective attention can be.

Gradient sensitivity at low temperature

The paper identifies a second, compounding limitation. The authors show that gradient sensitivity under softmax normalization presents challenges during training, particularly at low temperature settings. Temperature scaling is the standard knob for controlling the sharpness of softmax distributions: low temperature concentrates probability mass on the highest-scoring tokens, while high temperature flattens the distribution. The finding here is that the low-temperature regime, precisely where you would want sharper selection, is where gradient behavior becomes most problematic for optimization.

This is a two-hit problem. The representational ceiling limits what softmax attention can express at inference time; the gradient sensitivity limits how effectively models can be trained to approach even that ceiling in the sharp-selection regime. Together, they define a region of the design space (long contexts, many attended tokens, sharp focus desired) where softmax attention is structurally disadvantaged.

Existing off-ramps: softmax-free architectures

Alternatives to softmax attention already exist, and some have strong benchmark numbers. The most developed is SOFT (Softmax-free Transformer with Linear Complexity), from Fudan University, initially published at NeurIPS 2021 and extended in IJCV 2024. SOFT replaces softmax with a Gaussian kernel, eliminating the normalization step entirely. The Huge-Norm variant achieves 83.4% Top-1 accuracy on ImageNet-1K, demonstrating that softmax-free attention is practically viable in at least one domain.

Property	Standard Softmax Attention	SOFT (Gaussian kernel)
Normalization	Softmax over logits	None (kernel-based)
Complexity	Quadratic in sequence length	Linear
ImageNet-1K Top-1 (best variant)	~83% (ViT-Huge, softmax)	83.4% (Huge-Norm)
NLP validation	Ubiquitous	Not reported

The caveat is that SOFT’s published results are vision-only: ImageNet, COCO, ADE20K. Claiming parity or superiority for softmax-free attention in NLP or LLM settings is not yet supported by evidence. The numbers prove the approach works; they do not prove it transfers.

Responses from the research community (2025–2026)

Since the paper’s NeurIPS 2025 presentation, several parallel lines of work have addressed the same underlying dispersion problem without abandoning softmax entirely.

Post-norm resharpening. A late-2025 paper (“Post-Norm can Resharpen Attention,” arXiv:2510.08341) argues that applying RMSNorm to the attention output — rather than replacing softmax — partially recovers selectivity lost to dispersion. The authors frame attention dispersion as a precision loss in relative weight differences and show that a post-norm step counteracts it at negligible computational cost. This is a retrofit, not a redesign, and is compatible with existing checkpoints.

Scalable-Softmax and adaptive temperature. Another line of work scales the pre-softmax logits as a function of context length, so that the effective temperature rises with sequence length and counteracts the uniform-convergence drift. Adaptive temperature variants learn this scaling during training. Both approaches accept the geometric-separation constraint as given and work around it rather than eliminating it.

α-Entmax. A sparse alternative to softmax, α-entmax allows attention probabilities to collapse to exactly zero for tokens below a learned threshold. Unlike softmax, where adding a new token always dilutes existing weights, α-entmax can leave salient token weights unchanged while assigning zero mass to irrelevant ones. This directly addresses the over-smoothing mechanism Mudarisov et al. identify, and has been applied to smaller NLP models, though not yet validated at frontier scale.

Threshold Differential Attention. A January 2026 preprint (arXiv:2601.12145) combines differential attention — subtracting two softmax distributions to cancel attention sinks — with a threshold mechanism that enforces ultra-sparse patterns. The target failure mode overlaps substantially with the uniform-convergence problem: both diagnose softmax’s sum-to-one constraint as causing probability mass to bleed across irrelevant tokens at long context.

The practical landscape as of mid-2026 is: full softmax replacement is still confined to vision; language model practitioners have a growing toolkit of patches (post-norm, scalable logits, sparse alternatives) that mitigate dispersion without the pretraining cost of swapping the kernel. Whether any patch closes the representational gap described by the paper’s bounds, or merely shifts where the ceiling sits, remains an open measurement question.

What this means for architecture decisions

The practical question for teams running softmax-based models is whether the representational ceiling identified by Mudarisov et al. is binding for their workload. For short-context, moderate-vocabulary tasks, it probably is not. The uniform-convergence problem scales with the number of attended tokens, so workloads with short sequences may never encounter the ceiling in practice.

The calculus shifts for long-context models, retrieval-augmented generation, and any application where the attention matrix is expected to be sparse (a few highly relevant tokens among many irrelevant ones). In those regimes, the geometric-separation constraints and the over-smoothing tendency are more likely to be active bottlenecks. Recent work on sparse attention for million-token contexts illustrates how engineers have tried to work around this by learning which tokens to drop before the normalization step — a strategy that bypasses dispersion rather than solving it. The question is not whether the ceiling exists (it does, provably), but whether current workloads are hitting it.

For architecture-search teams, this result suggests that softmax should be treated as a modeling choice with a known representational cost, not a fixed default. Adding softmax to the search space (or at minimum, parameterizing the normalization strategy) is now defensible on theoretical grounds, not just empirical ones.

What remains unproven

Three gaps are worth stating plainly.

First, the GPT-2 validation leaves the frontier-scale question open. Whether the uniform-convergence bound binds at 100B+ parameters, with the depth and residual structure of modern LLMs, is unknown. The theory says yes; the experiment has not been run.

Second, the SOFT benchmarks are vision-only. No published result demonstrates that replacing softmax in a language model (even a small one) preserves or improves downstream NLP performance. Until someone retrains a GPT-2- or LLaMA-scale model with a softmax-free attention variant and benchmarks it, the cost-benefit analysis for NLP practitioners remains theoretical. The gap between a tight theoretical result and an experimentally validated one is a recurring pattern in this area — as attribution patching’s errors show when applied across non-linear boundaries, formal claims about attention behavior often require empirical correction before they translate to useful engineering guidance.

Third, the paper’s geometric-separation argument is distinct from the separate line of work on “softmax bottlenecks” in output vocabulary distributions (the Yang et al. 2018 observation that softmax limits the rank of the output probability matrix). These are different mechanisms with different fixes. Conflating them muddies both analyses.

The honest summary: softmax normalization constrains what attention can express, and the constraint is now formal rather than anecdotal. Off-ramps exist in vision. Whether they work in language, and whether the constraint is tight enough to justify retraining costs for existing stacks, are open empirical questions. Architecture teams building new models should at minimum consider normalization as a first-class design parameter. Teams maintaining deployed softmax stacks should treat the result as a known limitation to monitor, not an urgent reason to rewrite.

Frequently Asked Questions

Is this paper’s finding the same as the Yang et al. softmax bottleneck?

No. Yang et al. (2018) showed that softmax limits the rank of the output probability matrix over the vocabulary, restricting what the model can predict. Mudarisov et al.’s constraint operates on the attention weight distribution over the input sequence, restricting what the model can selectively focus on. They affect opposite ends of the network: one constrains the output projection, the other constrains attention pattern formation. The proposed fixes also diverge. Yang et al. suggest mixture-of-softmaxes to increase output rank; this paper’s framework points toward replacing the normalization step entirely, as SOFT does with its Gaussian kernel.

What experiment would confirm or refute this at frontier-model scale?

The concrete test would measure the entropy of attention weight distributions across layers as context length increases in a 70B+ parameter model with modern architectural features like RoPE, SwiGLU, and grouped-query attention. If the paper’s bound is tight, entropy should climb toward maximum (uniform weighting) as sequence length grows, and the rate of increase should match the theoretical prediction. The GPT-2 validation showed this pattern, but GPT-2 lacks the positional encodings, activation functions, and head-sharing strategies used in current production models, any of which could alter how the normalization constraint manifests.

Why has softmax-free attention not been tested on language models?

The bottleneck is pretraining compute, not theory. Swapping the attention kernel requires training from random initialization because learned representations at every layer are coupled to the normalization dynamics. In vision, ImageNet pretraining takes days on a modest GPU cluster, making the experiment tractable. For a frontier language model, pretraining costs run into millions of GPU-hours. No organization has committed that budget without first seeing evidence the approach works at smaller NLP scales, and the SOFT results cover image patches rather than discrete linguistic tokens, so they do not constitute that evidence.

Should inference temperature settings change based on the gradient sensitivity finding?

No. The paper’s gradient-sensitivity result is a training-time problem. At inference, temperature scaling still sharpens or flattens output distributions as intended. The concern is subtler: because models trained with softmax struggle to optimize in the low-temperature regime, the learned weights themselves may be suboptimal for tasks requiring sharp token discrimination. Setting a lower inference temperature does not fix weights that were trained poorly for that regime. The fix, if one is needed, would require changes to the training procedure or the normalization itself, not adjustments at serving time.