Anthropic Scaled Sparse Autoencoders to Claude 3 Sonnet. Interpretability Now Costs Compute

Anthropic researchers have extracted 34 million interpretable features from Claude 3 Sonnet using sparse autoencoders, the largest dictionary-learning decomposition ever applied to a production language model. The work, published as arXiv:2605.29358 on May 28, answers a question that has hovered over mechanistic interpretability for years: whether these methods work beyond toy models. They do. The cost is that the training itself is now a compute-intensive operation governed by scaling laws, which repositions interpretability from a cheap post-hoc analysis step to a first-class budget line.

34 million features from a production model

The paper, “Scaling Monosemanticity,” trains sparse autoencoders (SAEs) on the middle-layer residual stream of Claude 3 Sonnet. The largest autoencoder configuration extracts 34 million features, each corresponding to a human-interpretable pattern in the model’s internal activations: specific entities, abstract concepts, code errors, sarcasm, and more. According to the paper, these features are multilingual and multimodal, responding to images despite the autoencoders being trained exclusively on text data.

The 26-author team includes Chris Olah, an Anthropic co-founder who leads the company’s interpretability research, and Tom Henighan. The paper is categorized under cs.AI and is a preprint hosted on arXiv, a moderated but not peer-reviewed repository, as of May 31, 2026.

Dictionary learning, past the research-demo tier

Mechanistic interpretability has spent years in a frustrating state: promising results on small transformers, no clear evidence the methods survive contact with a production-scale model. The “Scaling Monosemanticity” paper directly addresses this gap. Dictionary learning, the family of techniques that decompose a model’s activations into sparse, interpretable directions, had previously been demonstrated on models small enough that researchers could manually verify every feature. Scaling to a model like Claude 3 Sonnet (whose parameter count Anthropic has not publicly disclosed) required systematic hyperparameter sweeps guided by scaling laws, rather than the ad-hoc tuning that worked at smaller sizes.

That scaling-law framing is the structural claim embedded in the paper. If SAE performance is predictable enough to be governed by scaling laws, then extracting interpretable features from a larger model is not a research gamble but an engineering problem with a known cost curve. The paper does not disclose training compute cost, wall-clock time, or the number of autoencoder sizes swept, so the exact price tag remains unknown. The implication is still clear: the compute cost scales with model size, and for a frontier model, it will be substantial.

Safety features as mechanistic audit levers

Among the 34 million extracted features, the authors identify clusters tied to safety-relevant behaviors: deception, power-seeking, sycophancy, and bias. The paper reports that manipulating these features causally influences model outputs in directions consistent with their interpretations. A feature identified as deception-related, when activated, steers the model toward deceptive outputs; the same holds for the other safety categories.

This is the section that will attract the most policy attention. If a lab can extract and name the specific internal directions corresponding to dangerous behaviors, those directions can function as audit checkpoints: activate the feature, measure the effect, and decide whether the model’s propensity for that behavior is within acceptable bounds. Anthropic, which operates as a public benefit corporation focused on AI safety, has a structural incentive to develop this capability.

But “can function as” is not the same as “reliably functions as.” The paper itself identifies two hard limitations.

Incomplete coverage, no faithfulness guarantee

The authors flag two constraints that prevent the results from constituting full model transparency.

First, the feature suite is incomplete. The 34 million features do not cover the entirety of the model’s computations; there are internal behaviors the SAE decomposition does not capture. Second, the paper lacks rigorous methods for evaluating whether individual features faithfully represent what the model is actually computing, as opposed to providing a post-hoc interpretation that happens to be plausible.

These limitations are not footnote caveats. Incomplete coverage means a safety audit based on extracted features could miss entire categories of concerning behavior. Lack of faithfulness evaluation means a feature that looks like “deception” might be capturing a correlated but causally distinct pattern. Any deployment of these features as safety guarantees would need to contend with both gaps.

Interpretability as a release gate

The strategic question the paper raises is not whether SAEs work at production scale (the evidence suggests they do). It is who pays for the compute.

If interpretability is a prerequisite for shipping a model, a release gate in practice, then every frontier model now carries two compute budgets: one for training the model, and one for training the autoencoders that make it auditable. The scaling-law results in the paper suggest the second budget grows with the first. For a lab shipping multiple model generations per year, this is a recurring cost.

If interpretability is a research curiosity instead, the cost still exists but can be amortized across publications rather than absorbed per release. The distinction is organizational: does the safety team have a seat at the release table with veto power backed by compute-budget allocation, or does it publish papers that the product team acknowledges and then ignores?

The paper does not take a position on this. By demonstrating that the compute cost is real and predictable, it forces the question onto the agenda of every lab that claims to care about model transparency. You can have auditable models. The bill arrives with the training run.

Frequently Asked Questions

Would the extracted features transfer to a different model like Claude 3 Opus or Llama?

No. Sparse autoencoders are trained on a specific model’s activation patterns, in this case Claude 3 Sonnet’s middle-layer residual stream. Each architecture and weight configuration produces distinct internal representations, so every new model requires retraining autoencoders from scratch. Labs shipping multiple model variants per year would incur the interpretability compute cost per variant, not once for the model family.

Individual features respond consistently to semantically equivalent content whether the input is text or images, despite the autoencoders seeing only text during training. This suggests the model’s representations are modality-agnostic at the probed layer. A feature catalog built from text probing could flag the same conceptual behavior triggered through image inputs, extending audit coverage beyond the modality used to train the feature detectors.

Can an external auditor reproduce this analysis without the model creator’s cooperation?

No. Training sparse autoencoders requires reading the model’s intermediate activations, which only the organization controlling the weights and inference infrastructure can expose. External auditors or regulators would need the lab to provide either raw activation access or pre-trained autoencoders with their feature catalogs. This places the audited party in control of the audit tooling, a structural conflict analogous to a company performing its own compliance review.

How do the two stated limitations compound each other in a safety audit?

Incomplete coverage means entire categories of concerning behavior may exist in pathways the SAE decomposition did not capture. Lack of faithfulness evaluation means even the extracted features might reflect correlated patterns rather than causal mechanisms. Together they produce a specific failure mode: an audit that clears all 34 million extracted features could still miss both whole categories of risk and individual features whose labels do not match what the model is actually computing.