Softmax normalization, the operation that converts raw attention scores into a probability distribution, imposes a provable ceiling on what attention can represent. A new paper by Tatiana Petrova formalizes this: softmax’s normalization step constrains the geometric separation between token vectors, and those constraints tighten as the number of attended tokens grows. The implication is not subtle. Teams building on softmax attention stacks are working inside a function class whose boundaries are now known, and whose alternatives are further along than widely assumed.
Softmax’s geometric separation ceiling
The core argument in Petrova’s paper is structural, not empirical. She develops a theoretical framework showing that softmax normalization imposes explicit bounds on the distances and separation criteria that token vectors can maintain after the attention operation. The softmax function maps unbounded logits to a probability simplex, and that mapping, by construction, compresses the range of distinguishable inputs. This is not a conjecture about particular weight configurations; it is a property of the normalization step itself, independent of model width or depth.
The framing matters because softmax has historically been treated as fixed scaffolding in architecture design. Search spaces for neural architecture search, scaling-law studies, and efficiency-focused variants (linear attention, FlashAttention, grouped-query attention) all retain softmax while optimizing around it. Petrova’s result says the scaffolding is load-bearing in a way that limits the structure it supports.
The uniform-convergence problem
The paper’s most immediately actionable finding concerns what happens as the number of selected tokens increases. Petrova demonstrates that the attention mechanism’s ability to distinguish informative tokens declines with the size of the attended set, converging toward a uniform selection pattern. In the limit, softmax attention assigns roughly equal weight to all tokens in the context, regardless of relevance.
This connects to a known phenomenon called “over-smoothing”: after softmax normalization, weight distributions become relatively uniform, assigning non-trivial weight to irrelevant or secondary information and reducing the model’s ability to focus sharply on key tokens. What Petrova adds is a formal proof that this is not an implementation artifact or a training failure, but a mathematical consequence of the normalization choice. The more tokens you attend to, the less selective attention can be.
Gradient sensitivity at low temperature
Petrova identifies a second, compounding limitation. The paper shows that gradient sensitivity under softmax normalization presents challenges during training, particularly at low temperature settings. Temperature scaling is the standard knob for controlling the sharpness of softmax distributions: low temperature concentrates probability mass on the highest-scoring tokens, while high temperature flattens the distribution. The finding here is that the low-temperature regime, precisely where you would want sharper selection, is where gradient behavior becomes most problematic for optimization.
This is a two-hit problem. The representational ceiling limits what softmax attention can express at inference time; the gradient sensitivity limits how effectively models can be trained to approach even that ceiling in the sharp-selection regime. Together, they define a region of the design space (long contexts, many attended tokens, sharp focus desired) where softmax attention is structurally disadvantaged.
Existing off-ramps: softmax-free architectures
Alternatives to softmax attention already exist, and some have strong benchmark numbers. The most developed is SOFT (Softmax-free Transformer with Linear Complexity), from Fudan University, initially published at NeurIPS 2021 and extended in IJCV 2024. SOFT replaces softmax with a Gaussian kernel, eliminating the normalization step entirely. The Huge-Norm variant achieves 83.4% Top-1 accuracy on ImageNet-1K, demonstrating that softmax-free attention is practically viable in at least one domain.
| Property | Standard Softmax Attention | SOFT (Gaussian kernel) |
|---|---|---|
| Normalization | Softmax over logits | None (kernel-based) |
| Complexity | Quadratic in sequence length | Linear |
| ImageNet-1K Top-1 (best variant) | ~83% (ViT-Huge, softmax) | 83.4% (Huge-Norm) |
| NLP validation | Ubiquitous | Not reported |
The caveat is that SOFT’s published results are vision-only: ImageNet, COCO, ADE20K. Claiming parity or superiority for softmax-free attention in NLP or LLM settings is not yet supported by evidence. The numbers prove the approach works; they do not prove it transfers.
What this means for architecture decisions
The practical question for teams running softmax-based models is whether the representational ceiling Petrova identifies is binding for their workload. For short-context, moderate-vocabulary tasks, it probably is not. The uniform-convergence problem scales with the number of attended tokens, so workloads with short sequences may never encounter the ceiling in practice.
The calculus shifts for long-context models, retrieval-augmented generation, and any application where the attention matrix is expected to be sparse (a few highly relevant tokens among many irrelevant ones). In those regimes, the geometric-separation constraints and the over-smoothing tendency are more likely to be active bottlenecks. The question is not whether the ceiling exists (it does, provably), but whether current workloads are hitting it.
For architecture-search teams, Petrova’s result suggests that softmax should be treated as a modeling choice with a known representational cost, not a fixed default. Adding softmax to the search space (or at minimum, parameterizing the normalization strategy) is now defensible on theoretical grounds, not just empirical ones.
What remains unproven
Three gaps are worth stating plainly.
First, the GPT-2 validation leaves the frontier-scale question open. Whether the uniform-convergence bound binds at 100B+ parameters, with the depth and residual structure of modern LLMs, is unknown. The theory says yes; the experiment has not been run.
Second, the SOFT benchmarks are vision-only. No published result demonstrates that replacing softmax in a language model (even a small one) preserves or improves downstream NLP performance. Until someone retrains a GPT-2- or LLaMA-scale model with a softmax-free attention variant and benchmarks it, the cost-benefit analysis for NLP practitioners remains theoretical.
Third, Petrova’s geometric-separation argument is distinct from the separate line of work on “softmax bottlenecks” in output vocabulary distributions (the Yang et al. 2018 observation that softmax limits the rank of the output probability matrix). These are different mechanisms with different fixes. Conflating them muddies both analyses.
The honest summary: softmax normalization constrains what attention can express, and the constraint is now formal rather than anecdotal. Off-ramps exist in vision. Whether they work in language, and whether the constraint is tight enough to justify retraining costs for existing stacks, are open empirical questions. Architecture teams building new models should at minimum consider normalization as a first-class design parameter. Teams maintaining deployed softmax stacks should treat the result as a known limitation to monitor, not an urgent reason to rewrite.
Frequently Asked Questions
Is Petrova’s finding the same as the Yang et al. softmax bottleneck?
No. Yang et al. (2018) showed that softmax limits the rank of the output probability matrix over the vocabulary, restricting what the model can predict. Petrova’s constraint operates on the attention weight distribution over the input sequence, restricting what the model can selectively focus on. They affect opposite ends of the network: one constrains the output projection, the other constrains attention pattern formation. The proposed fixes also diverge. Yang et al. suggest mixture-of-softmaxes to increase output rank; Petrova’s framework points toward replacing the normalization step entirely, as SOFT does with its Gaussian kernel.
What experiment would confirm or refute this at frontier-model scale?
The concrete test would measure the entropy of attention weight distributions across layers as context length increases in a 70B+ parameter model with modern architectural features like RoPE, SwiGLU, and grouped-query attention. If Petrova’s bound is tight, entropy should climb toward maximum (uniform weighting) as sequence length grows, and the rate of increase should match the theoretical prediction. The GPT-2 validation showed this pattern, but GPT-2 lacks the positional encodings, activation functions, and head-sharing strategies used in current production models, any of which could alter how the normalization constraint manifests.
Why has softmax-free attention not been tested on language models?
The bottleneck is pretraining compute, not theory. Swapping the attention kernel requires training from random initialization because learned representations at every layer are coupled to the normalization dynamics. In vision, ImageNet pretraining takes days on a modest GPU cluster, making the experiment tractable. For a frontier language model, pretraining costs run into millions of GPU-hours. No organization has committed that budget without first seeing evidence the approach works at smaller NLP scales, and the SOFT results cover image patches rather than discrete linguistic tokens, so they do not constitute that evidence.
Should inference temperature settings change based on the gradient sensitivity finding?
No. Petrova’s gradient-sensitivity result is a training-time problem. At inference, temperature scaling still sharpens or flattens output distributions as intended. The concern is subtler: because models trained with softmax struggle to optimize in the low-temperature regime, the learned weights themselves may be suboptimal for tasks requiring sharp token discrimination. Setting a lower inference temperature does not fix weights that were trained poorly for that regime. The fix, if one is needed, would require changes to the training procedure or the normalization itself, not adjustments at serving time.