How Linear Is a Transformer Feed-Forward Block? A New Test Says It's Learned, Not Built In

Most quantization and pruning pipelines treat feed-forward blocks in transformers as approximately linear, a handy assumption that makes cheap approximations defensible. A June 2026 preprint from Stuart Whipp tests that assumption directly and finds it holds sometimes, fails badly in other layers, and cannot be predicted from either architecture or activation function. The safe-to-compress blocks must be identified per checkpoint, per layer.

What does “linear recoverability” actually measure?

The paper defines R²_lin as the fraction of held-out output variance explained by the exact closed-form least-squares linear approximation of each FFN block, treating the block as a position-wise input-to-output map. No optimizer, no training, the closed-form solution gives the ceiling on what any linear approximation could achieve. A block with R²_lin near 1.0 is nearly linearly representable; one near 0.3 is not.

That choice of ceiling matters. Training a linear layer to approximate a target function on ill-conditioned transformer activations can severely under-converge, which means a trained linear baseline may score lower than the block’s true linear approximability. The paper argues that prior work using trained baselines may have systematically underestimated actual block linearity, in the wrong direction for the compression conclusions they drew.

What does the distribution of R²_lin look like across layers?

Across all twelve FFN blocks of GPT-2, Pythia-160m, and llama-160m, the paper reports R²_lin swinging from above 0.99 to below 0.3 between adjacent layers, with a pattern that is non-monotone with depth. Linear recoverability does not increase or decrease predictably from input layers to output layers; the profile is irregular and model-specific.

That heterogeneity is the central finding. A compression pipeline that applies the same linear approximation strategy uniformly across all FFN blocks will get a sharply different quality hit on different layers, with no way to predict which layers degrade silently without probing them first.

Why doesn’t the activation function explain the pattern?

GPT-2 and Pythia-160m both use GELU activations and are the same width, yet their R²_lin profiles are sharply different. If activation nonlinearity were the determining factor for block linearity, same-architecture models would show similar patterns.

The implication is that linear recoverability is a property of the trained checkpoint, not the architecture diagram. The training trajectory, data distribution, initialization, optimization path, shapes how nonlinear each block’s learned function ends up being. Architecture constrains the space; training determines where in that space you land.

What is the practical compression signal?

For blocks with high R²_lin, the paper demonstrates that GPT-2’s early FFN can be replaced with a single linear layer at 8x fewer parameters for a perplexity cost of +0.77. That figure is paper-reported and not independently replicated. Whether it generalizes beyond small models is an open question.

For blocks with low R²_lin, single-layer linear replacement is explicitly flagged as unsafe. The paper does not claim these blocks cannot be compressed by other means; it claims the standard linear approximation fails them in ways that high-R²_lin blocks tolerate.

What are the limits of this result?

The paper studies GPT-2, Pythia-160m, and llama-160m, all around 160M parameters or smaller. Whether heterogeneous R²_lin patterns persist at 7B, 70B, or larger scales is unverified. It is plausible that longer training runs on larger models converge to different linearity profiles; it is equally plausible the heterogeneity grows. Neither direction is established by this preprint.

The paper’s contribution is a measurement method with clear theoretical grounding and an empirical finding that should push compression tooling toward per-block profiling. Whether the heterogeneity pattern is universal or a small-model artifact is the obvious next experiment, and the obvious gap in what this preprint can claim.

Frequently Asked Questions

Does R²_lin profiling apply to mixture-of-experts FFN blocks?

The paper tests only dense position-wise FFN blocks. In MoE architectures, each token routes dynamically to one or a subset of expert FFNs, so a single closed-form linear approximation would need to cover the full routing mixture rather than one static function. R²_lin could be measured per expert in isolation, but the result would not capture compression safety for the block as a whole, making the paper’s findings not directly transferable to MoE compression pipelines.

How does R²_lin differ from quantization error as a compression signal?

Quantization methods like GPTQ reduce weight precision while keeping the original function intact; R²_lin asks whether a fundamentally different function (a single linear map) can replace the block entirely. A block can be easy to quantize while remaining strongly nonlinear in behavior, or vice versa. The two signals address orthogonal properties of an FFN block and cannot substitute for each other in a compression audit.

What does measuring R²_lin cost in a compression pipeline?

Because R²_lin uses the exact closed-form least-squares solution, there is no training loop or gradient computation involved. The cost per block is one forward pass to collect activations plus one matrix inversion, making a full 12-block profile negligible compared to any fine-tuning or distillation step that would follow compression decisions.

If larger models show uniform high R²_lin, does per-block profiling become unnecessary?

Yes, but that outcome is unconfirmed. Some theoretical work on overparameterized networks suggests wider models may converge closer to linear regimes, which would push R²_lin uniformly toward 1.0 and make per-block probing redundant. The opposite is equally plausible: more training compute could sharpen task-specific nonlinearities in certain layers. Until the measurement is replicated on 7B or larger checkpoints, the conservative position is to profile regardless of model size.