What Flexformer changes: trainable spectral frequencies vs fixed random features
Flexformer’s contribution is to take the kernel that linear attention uses to approximate softmax and make its parameters learnable, by treating the spectral frequencies of a random Fourier feature expansion as trainable weights. Linear attention works by rewriting the attention operation so that the N-by-N interaction matrix never materializes: if you replace the softmax score with an inner product of feature maps, you can fold the keys and values together first and achieve compute linear in the sequence length. The difficulty is that the feature map has to approximate softmax, which is not a well-behaved kernel.
Random Fourier features are the standard tool for approximating shift-invariant kernels. You sample frequencies from the kernel’s spectral distribution and build an unbiased finite-dimensional estimator. Performer applied a version of this to attention, with adjustments to keep the features positive. Approximation quality is bounded by how many features you draw and how faithfully the sampled frequencies represent the target kernel.
Flexformer’s move is to stop sampling those frequencies from a fixed distribution and learn them instead. The paper frames fixed or weakly learnable kernels as the expressiveness bottleneck that has kept linear attention behind softmax, and proposes that making the spectral frequencies trainable lets the model discover a better kernel for the data at hand. The idea is plausible on its face: a data-dependent feature map has strictly more freedom than a fixed one.
Stationary vs nonstationary kernels: what “strictly greater expressiveness” means
Flexformer ships in two variants, and the distinction matters because only one is a strict generalization of the other. A stationary kernel depends only on the difference between its two inputs, so its value is translation-invariant. Random Fourier features are the natural tool here because Bochner’s theorem ties stationary kernels to a spectral distribution you can sample from. The nonstationary variant drops translation-invariance, letting the kernel depend on the absolute inputs rather than just their difference.
The paper claims the nonstationary kernel offers strictly greater expressiveness than the stationary one. That is a statement about the function class the kernel can represent, not a measured accuracy result. A larger function class is usually good for capacity, but it is also where overfitting and optimization difficulty live.
There is a theoretical wrinkle as well. Bochner’s theorem, which licenses the random-Fourier-feature construction, applies to stationary kernels. Extending learned frequencies to a nonstationary kernel means the clean spectral guarantee no longer applies directly, and the quality of the approximation depends on assumptions the abstract does not spell out.
What the abstract claims, and what it does not quantify
The abstract states that Flexformer “consistently outperforms baselines” on language modeling and sequence classification, and reaches “competitive performance on long-sequence tasks.” It does not report a perplexity number, an accuracy figure, or the names of the baselines it beats. For a reader deciding whether to swap a working softmax or FlashAttention stack for a linear-attention kernel, those are exactly the numbers that matter.
The framing in the abstract is also the framing most likely to survive review without specifics: a broad claim of superiority on standard tasks, hedged by “competitive” on the harder long-sequence setting. That is consistent with a method that improves on prior linear-attention approximations but has not closed the gap to exact attention. It is equally consistent with a method that has closed the gap, and the abstract alone cannot distinguish the two.
The honest reading is that Flexformer advances prior linear-attention baselines by an unquantified margin. Asserting parity with softmax, or superiority to named approximations such as Performer and Linformer, requires numbers the abstract does not provide. The submission record (arXiv:2606.27748, posted 26 June 2026 by Haoran Zhang and Feng Zhou, classified under cs.LG and cs.AI, DOI 10.48550/arXiv.2606.27748) is verified; the empirical deltas are not yet.
Distillation from a pretrained Transformer: recovering softmax at what cost
Flexformer can be distilled from a pretrained softmax Transformer to recover softmax attention, and the paper also reports “strong kernel transferability across domains”. Distillation is the most practically useful claim in the abstract, and it also quietly limits the efficiency story.
The appeal is obvious. Training a linear-attention model from scratch with an approximate kernel is the part of the field that has always been hard: the kernel introduces bias and variance that slow convergence and cap accuracy. Distillation sidesteps this by starting from a teacher that already learned good attention behavior and transferring it into the Flexformer student. The student then runs at linear cost at inference, while the expensive softmax training is paid once, in the teacher.
The cost is that the teacher has to exist and has to be paid for. If the teacher is a large pretrained Transformer, the headline inference efficiency of the Flexformer student is real, but it sits on top of a sunk training cost that linear attention was supposed to help avoid. “Recovering softmax” in this context means approximating one specific pretrained model’s attention, not matching softmax as a general capability.
The transferability claim is the one that would soften the overfitting worry introduced by the nonstationary variant: if a learned kernel ports between domains without relearning, then flexibility is not just overfitting. Like the accuracy claims, it is unquantified in the abstract, and it is the result most worth checking against the paper’s actual tables.
Does a learnable kernel close the gap, or just relocate the approximation error?
This is the question the abstract does not answer, and it is the one that decides whether Flexformer is a genuine step or a reshuffling of where linear attention loses information.
The accuracy gap in linear attention has a specific origin. Softmax attention is not a positive-definite kernel, its features are unbounded, and any finite-dimensional feature expansion truncates it. A fixed kernel absorbs a fixed, data-independent approximation error set by the spectral sampling. Performer and its peers live with that error, and the gap to exact attention is the price of linear time.
A learnable kernel changes the locus of the error rather than eliminating it. By learning the frequencies, Flexformer can fit a feature map that better matches the training distribution’s statistics, which plausibly reduces the bias a fixed kernel carries. The risk is symmetric. A kernel fit to training data may not generalize, in which case the error reappears as a train/test gap instead of a kernel-bias gap, and you have spent extra capacity to move the problem rather than remove it. The transferability result is the paper’s answer to this worry, but a strong transfer claim is itself the experiment that would need to be quoted.
The charitable read is that a data-dependent kernel is strictly more flexible than a fixed one and so can only help on the bias side, while paying for it with variance. The skeptical read is that flexibility purchased with variance is exactly what random features were designed to avoid, and the field moved toward FlashAttention and state-space models partly because tuning the linear-attention approximation was more trouble than it was worth. Both reads are consistent with the abstract. The full PDF’s generalization and transfer experiments are what would settle it.
Where Flexformer sits next to Performer, Linformer, FlashAttention, Mamba and RetNet
Flexformer enters a field that has largely moved on from kernel design, which is both why the idea is interesting and why adoption is uncertain.
FlashAttention is not an approximation. It computes exact softmax attention and wins on memory and IO through kernel fusion, which is why it reset the practical bar for the sequence lengths most practitioners actually use. For those lengths, the accuracy-for-speed tradeoff of linear-approximation methods is hard to justify when exact attention is fast enough and loses nothing. Linformer, in turn, reduced the gap with low-rank projection rather than kernel approximation, attacking a different axis of the cost.
At the very-long-sequence end, Mamba and RetNet abandoned attention for state-space and retention formulations that are linear in sequence length and have reported strong long-context results. Those methods compete with linear attention on its home turf and have absorbed much of the long-context demand that kernel methods were originally targeting.
Performer is Flexformer’s closest relative: both use random Fourier features, which makes Flexformer a natural extension of Performer’s feature construction, with the frequencies promoted to learned parameters. The honest framing is that Flexformer keeps the Transformer architecture and the attention mechanism, and tries to repair the approximation rather than replace the mechanism. Whether anyone still wants that repair when FlashAttention covers the practical range and SSMs cover the very-long range is a question a kernel paper that does not report sequence-length-specific accuracy cannot answer. The mechanism is sound; the case for swapping it in is, for now, unproven.
Frequently Asked Questions
If Performer already uses random Fourier features, what does learning the frequencies actually change?
Performer samples its frequencies once from softmax’s spectral distribution and averages over multiple random draws to control variance. Flexformer replaces that random sample with gradient-updated weights, gaining the freedom to fit data-dependent frequency content but forfeiting the variance reduction that re-sampling provides, since the frequencies are now fixed learned parameters rather than a stochastic draw.
What theoretical guarantee does the nonstationary variant give up?
Bochner’s theorem is what licenses random Fourier features for stationary kernels, guaranteeing an unbiased estimator with analytically bounded approximation error. The nonstationary variant drops translation-invariance, so it loses both the unbiasedness and the bound, leaving approximation quality measurable only empirically on each dataset rather than derivable before training.
Where does Flexformer’s linear attention actually pay off over FlashAttention?
During autoregressive generation, FlashAttention’s KV cache grows with every token produced, so its memory footprint climbs with sequence length. Linear attention folds keys and values into a fixed-size state, so Flexformer’s advantage is in long-context generation, where the KV cache would otherwise dominate memory, not in training where FlashAttention’s fused exact attention is fast and loses nothing.
What would swapping to Flexformer require in an existing Transformer codebase?
Only the attention scoring changes: the softmax over query-key products is replaced by an inner product of learned feature maps, while the QKV projections, positional encodings, feed-forward layers, and residuals stay. The new moving parts are the frequency parameters per head and the feature-map implementation, plus a decision on whether to initialize from scratch or distill from an existing pretrained checkpoint.