Does Attribution Patching Lie? A Fix for a Common Interpretability Shortcut

Attribution patching, the gradient-based shortcut that made circuit discovery affordable for large language models, systematically misattributes importance in certain regimes. A June 2026 paper by Jialu Wang and collaborators pins the failure on downstream non-linearities the linear approximation ignores, and offers a fix that costs one extra backward pass.

What is attribution patching and why do teams use it?

Full activation patching requires a separate forward pass for every model component you want to test, which is prohibitively expensive on anything beyond toy-scale networks. Attribution patching, introduced by Neel Nanda in March 2023, approximates the result with two forward passes and one backward pass by treating the model’s response to a perturbation as locally linear. Nanda himself called it a “useful but flawed exploratory technique” and noted it works reasonably on small activations like individual attention-head outputs but poorly on large ones like the full residual stream.

The shortcut caught on quickly. Arthur Conmy’s Attribution Patching Outperforms Automated Circuit Discovery, presented at the NeurIPS 2023 ATTRIB Workshop, showed attribution patching achieved higher AUC on circuit recovery than ACDC averaged across tasks. For teams that needed to scan large models for circuits without burning compute budgets, the tradeoff was acceptable: fast and good enough, if not rigorous.

Where does the linear approximation break?

The implicit assumption behind attribution patching is that the error from the linear approximation is locally contained at the patched component. The new paper (arXiv:2606.09899) demonstrates this assumption is wrong. The dominant error source is non-linearity in the downstream network, not local curvature at the patched component itself.

The distinction matters. If the error were local, you could reason about it component by component. If it propagates and amplifies through downstream layers, the attribution map you get back may highlight the wrong components entirely. The paper characterizes this as the first-order approximation “lying” about which components matter, though the word carries more rhetorical weight than the underlying math requires. Nanda’s original caveat already acknowledged the method was unreliable on large activations; the contribution here is identifying why and where, not the existence of the problem.

What tools does the paper introduce?

The authors propose three practical additions to the attribution-patching pipeline:

A reliability score that flags components where the first-order estimate is likely untrustworthy, giving practitioners a diagnostic before they commit to acting on the results.
Error bounds on attribution mis-specifications, quantifying how far the linear estimate can diverge from the true activation-patching result.
A Hessian-vector-product (HVP) correction that eliminates the leading-order error term with a single additional backward pass.

The HVP correction is the headline tool. Computing the full Hessian is intractable for transformer-scale models, but a Hessian-vector product can be obtained via a single backward pass without materializing the Hessian itself. This is a well-known trick in the optimization literature; the contribution here is applying it to the specific structure of attribution patching’s error.

How does HVP compare to existing alternatives?

The paper reports that HVP is the only second-order correction feasible at larger model scales, where standard baselines like Integrated Gradients become computationally prohibitive. A multi-step variant of the HVP correction matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, according to the paper’s abstract.

This matters because the compute budget is the reason teams chose attribution patching in the first place. If the correction were expensive, it would undermine the original motivation. One extra backward pass is a modest overhead: the ratio of cost to accuracy improvement appears favorable, though the specific accuracy deltas are abstract-level claims not yet verified against the full paper.

Separately, an April 2026 study on contrastive attribution (arXiv:2604.17761) found that token-level contrastive attribution produces informative signals in some LLM failure cases but is not universally applicable. The convergence is notable: the field is accumulating evidence that all attribution-based tools have regime-dependent reliability, and the HVP correction addresses one specific regime rather than solving the general problem.

What does the Screen-Flag-Fix workflow look like in practice?

The paper proposes a three-stage pipeline called Screen-Flag-Fix. Run cheap first-order attribution patching across all components. Use the reliability score to flag the ones where the linear estimate is suspect. Apply the HVP correction only to those flagged components.

This is a straightforward gated-correction pattern: spend compute only where the cheap method has been diagnosed as unreliable. The Hybrid Attribution and Pruning (HAP) framework, presented at NeurIPS 2025, used a similar logic, treating attribution patching as a fast pre-filter before expensive edge pruning. HAP explicitly acknowledged attribution patching’s low faithfulness to the full model, which means the field was already aware the shortcut was unreliable on its own. The difference here is the diagnostic layer: rather than applying the expensive correction uniformly, Screen-Flag-Fix tries to target it.

One gap worth noting: the paper proposes Screen-Flag-Fix but does not benchmark it against the simpler alternative of running full activation patching on whatever components the reliability score flags. Whether the HVP correction is more efficient than just falling back to full activation patching on flagged components remains an open question.

What should practitioners do today?

Three concrete steps for teams running attribution patching in production circuit-discovery pipelines:

Audit existing attribution maps for the failure regime. If your pipeline patches large activations like the full residual stream, the non-linearity error is likely material. Nanda flagged this in 2023; this paper quantifies the mechanism.

Implement the reliability score as a gate. Before trusting an attribution estimate, check whether the component falls in the regime where first-order approximation is reliable. This is low-cost diagnostic work that requires no changes to the core patching logic.

Evaluate the HVP correction on flagged components. The marginal cost is one backward pass per flagged component. Against that, you get a second-order-corrected estimate that, per the paper’s abstract, eliminates the leading-order error. Verify the improvement on your own model and task before adopting it as a default.

The broader lesson is not specific to attribution patching. Any interpretability method that relies on a linear approximation of a non-linear model inherits some version of this problem. The paper gives a concrete fix for one widely used method, and the gated-correction pattern, where a cheap diagnostic decides whether to pay for an expensive correction, is portable to other approximation shortcuts in the field.

Frequently Asked Questions

Does the HVP correction hold up on frontier-scale models beyond 9B parameters?

The paper evaluates five model families capped at 9B parameters using both random-token and naturalistic name-swap perturbations. Behavior at 70B or trillion-parameter scale is unverified. Non-linearity effects could compound across more transformer layers at greater depth, so the favorable one-extra-backward-pass cost ratio might not hold when the downstream path spans over a hundred layers.

How many passes does Integrated Gradients actually require compared to multi-step HVP?

Integrated Gradients numerically integrates gradients along an interpolation path between the input and a baseline, which typically demands 20 to 300 forward passes depending on the chosen step count. Multi-step HVP adds one backward pass per iteration, so even a 10-step variant costs roughly 10 backward passes, an order of magnitude cheaper than IG on the same model.

What happens if the reliability score misses a bad component?

A false negative on the score means the practitioner trusts an uncorrected first-order estimate that is actually wrong. That outcome is arguably worse than the pre-paper baseline, where teams using frameworks like HAP (NeurIPS 2025) at least treated attribution patching as a known-unreliable pre-filter. Introducing a diagnostic that occasionally fails can create misplaced confidence in results that would otherwise be viewed skeptically.

Which model components are most likely to trigger the reliability score?

Nanda’s original 2023 observations suggest the error scales with activation magnitude: individual attention-head outputs are small enough for the linear approximation to hold, while full residual-stream positions and later-layer outputs are where downstream non-linearity compounds most. Residual-stream patches and late-layer MLP outputs are the strongest candidates for flagging, though the paper does not publish a component-type breakdown in its abstract.