Do LLMs Know What Not to Say? Causal Evidence for Statistical Preemption

Large language models suppress continuations they have learned are wrong, and they do it during pretraining, not because alignment training taught them to refuse. arXiv:2605.23039, accepted at CoNLL 2026, provides the first causal evidence for statistical preemption in LLMs: a mechanism where exposure to one grammatical form (say, “donated X to Y”) actively suppresses an unattested alternative (“donated Y X”). The suppression operates inside the model’s representations, not at the output layer where most safety interventions are applied.

What statistical preemption is

Statistical preemption is the hypothesis that language learners avoid producing unattested constructions because they have encountered a competing form often enough that the alternative gets implicitly blocked. You hear “he explained the problem to me” hundreds of times; you never hear “he explained me the problem.” The frequency of the first form preempts the second.

The concept comes from construction grammar and has been debated in psycholinguistics for decades. What Guo, Wu, and Yiu have done is test whether LLMs exhibit the same pattern, and crucially, whether the pattern is causal rather than correlational.

The four-experiment design

The paper runs four experiments, each building on the last.

Experiment 1: Correlation. The authors measure how well LLM surprisal (the model’s internally computed probability of a token) correlates with human acceptability judgments across 120 English verb-construction pairings, validated against three independent behavioral datasets. The correlation: r=0.79. Constructions that the model finds surprising line up with the ones humans find unacceptable.

Experiment 2: Dissociation. A correlation could be explained by simple frequency: common verbs get learned better. The authors use non-circular partial correlations to separate competing-form frequency from overall verb frequency. The result (r_partial=0.58) confirms that preemption is driven by how often the model has seen the competing form, not by how common the verb is generally.

Experiment 3: Scaling. Across 14 tested models, preemption sensitivity scales as a power law with model size. Larger models develop stronger suppression of unattested alternatives. The paper does not name which specific models were tested, making it difficult to assess whether the scaling holds across architectural families or only within a single lineage.

Experiment 4: Causation. A controlled fine-tuning experiment manipulates competing-form frequencies directly and observes whether preemption behavior shifts in the predicted direction. It does. Reverse-direction controls rule out the confound that models are simply more sensitive to frequency in general.

The study covers three construction types (dative, causative, and locative), with the dative set comprising 80 items. This is broader than prior work, which largely focused on dative alternation alone.

What the fine-tuning experiment actually proves

This is the paper’s strongest claim and deserves precision. The authors fine-tune models to change the relative frequencies of competing constructions and measure whether surprisal patterns shift accordingly. When competing-form frequency increases, the model suppresses the unattested alternative more aggressively. The reverse-direction control shows this is not just general frequency sensitivity.

Why output-layer interventions may be insufficient

Most production hallucination mitigations operate at or near the output: refusal classifiers, output filters, decoding constraints, and RLHF-style reward models that shape what the model emits. If statistical preemption operates deeper in the model’s representations during the forward pass, then interventions that only modify the final logits or apply a post-hoc refusal head are addressing a different layer of the problem.

The power-law scaling result matters here. If larger models develop stronger internal suppression of alternatives they have learned are wrong, a safety intervention that only retrains the output head is working against, or at least independently of, a mechanism that strengthens on its own as model capacity grows. That raises the cost of post-hoc fine-tuning.

This is where a companion paper from the same submission window becomes relevant. arXiv:2605.22873 demonstrates that LLM reasoning is a dynamic decoding state detectable via early entropy dynamics, achieving 41-55% token reduction with accuracy gains. Both papers treat model behavior as something that unfolds across layers during the forward pass, not something that only exists at the output token. Evaluating models by what they emit on the surface misses where the decision actually gets made.

What this means for evaluation

Refusal-rate benchmarks tell you what the model outputs, not why. A model that refuses because of an output-layer classifier and a model that refuses because of internal suppression of the harmful continuation look identical on standard benchmarks.

Causal probing methods, intervening on activations and measuring the effect on downstream behavior, can distinguish these cases. The fine-tuning experiment in this paper is a template for that kind of evaluation.

Power-law scaling of preemption suggests larger models may have stronger internal guardrails that are invisible to output-only evaluations. Teams decommissioning safety filters as models grow may be removing redundant surface measures while the underlying mechanism strengthens on its own.

The deeper pattern

Both arXiv:2605.23039 and arXiv:2605.22873 point toward the same conclusion: LLMs develop structured internal representations during pretraining that go beyond next-token prediction. Preemption is not a learned refusal behavior. It is a statistical consequence of exposure to distributional competition in the training data. The model does not need to be told what not to say. It needs to have seen enough of what is said instead.

For teams building on top of LLMs, the takeaway is methodological. If suppression lives in the representations rather than the output head, the evaluation stack needs causal probes and layer-level analysis, not just surface benchmarks. The CoNLL 2026 acceptance of this work signals that computational linguistics is already moving in that direction. Engineering teams building safety and hallucination mitigations should follow.

Frequently Asked Questions

How does preemption differ from entrenchment, and why does that distinction matter?

Entrenchment predicts that frequent exposure to a form makes it easier to produce. Preemption makes the stronger claim that exposure to one form actively suppresses a competing alternative. The paper’s partial-correlation design (r_partial=0.58) dissociates the two, confirming that models are not just better at producing what they have seen often but specifically worse at producing forms for which they have seen a competitor. Psycholinguistics debated these mechanisms theoretically for decades; this is the first causal evidence separating them in LLMs.

Does statistical preemption generalize beyond English?

The paper tests only English constructions (dative, causative, locative). Languages with freer word order, richer morphology, or different construction inventories may exhibit different preemption dynamics. The power-law scaling across 14 models reflects English training data distributions. Cross-linguistic validation would be needed before assuming the mechanism operates identically in, say, agglutinative languages where grammatical roles are marked morphologically rather than through construction alternation.

Which specific models were tested, and does the power law hold across architectures?

The paper does not name the 14 models used for the scaling experiment. If all tested models are transformer-family architectures trained on similar English corpora, the power law could reflect shared inductive biases rather than a universal property of language modeling. Whether state-space models (e.g., Mamba), mixture-of-experts designs, or models with different tokenization schemes show the same scaling remains an open question the authors do not address.

What would a causal safety probe look like in practice?

The fine-tuning intervention works because grammatical alternations provide clean competing pairs with measurable frequency ratios. Safety domains lack this structure: a hallucinated statistic or harmful response does not have a single canonical rival whose frequency can be cleanly manipulated. Building a causal probe for safety would require synthetic datasets where harmful and harmless continuations compete for the same prompt context, then measuring activation-level shifts as the competition ratio changes. This is feasible but requires safety-specific alternation taxonomies that do not currently exist.