Chain-of-thought corruption benchmarks test whether a model’s final answer survives injecting errors into its reasoning chain. The standard interpretation is that high survival rates mean the model relies on its own reasoning, while low rates mean it is easily misled by the corrupted text. Gabriel Garcia’s arXiv
.107991 shows the standard interpretation is wrong: the benchmark largely measures whether the final answer appears at the end of the chain, not whether the reasoning caused it.What CoT Corruption Studies Claim to Measure
A typical corruption study works like this. Take a correct chain-of-thought (CoT) from GSM8K or MATH, splice in a false statement near the end of the reasoning, and check whether the model still outputs the correct final answer. If the answer flips, the corrupted step is labeled “causally load-bearing”; if it survives, the step is deemed non-critical. The aggregate suffix-survival score (how often the correct answer holds when the corruption is at the end) is taken as a proxy for reasoning faithfulness.
The assumption is that the model reads the chain, weighs the reasoning steps, and either follows the corrupted suffix or overrides it based on the prior correct reasoning. Garcia shows this assumption collapses on contact with a trivial control: remove only the final answer statement while preserving every reasoning step, and see what happens.
The Format Confound: Answer-Text Readout Dominance
The confound is positional. In standard CoT benchmarks, the final answer appears in the last line or final sentence. Corruption probes that inject errors into the suffix therefore also move the answer text away from the end of the context. Garcia’s key experiment asks: does the drop in accuracy reflect the model rejecting bad reasoning, or does it reflect the model defaulting to the last explicit answer it sees?
Stripping just the final answer statement from GSM8K chains (leaving all reasoning intact) reduced suffix-corruption sensitivity by roughly 19× for Qwen 2.5-3B1, from a pronounced drop under corruption to a nearly flat line. A within-stable replication at 7B showed 9.3× attenuation (N=76, p=7.8×10⁻³)1. The model did not need the final answer to reason correctly, but it needed the final answer to be at the end in order for the corruption benchmark to register any effect at all.
On MATH with DeepSeek-R1-7B1, the same manipulation produced a 10.9× recovery in suffix-survival1, confirming the effect crosses benchmarks and model families. And on suffix-free chains, Phi-3-mini1 flipped the pattern entirely: prefix corruption became load-bearing (Δ=−0.77, p<10⁻¹²)1, inverting the usual suffix-sensitivity story. The position of the answer text, not the causal structure of the reasoning, is what drives the score.
The Experiments: 19× Collapse and the Conflicting-Answer Test
Garcia’s second test is blunter. Take a correct reasoning chain and append an explicit wrong final answer. The reasoning says 42; the final line says 17. Across five open-weight model families at 7B, causal-consumption accuracy under this conflicting-answer prompt fell to ≤0.021. The models followed the explicitly stated wrong answer almost without exception.
The wrong-answer following rate was 0.63, 1.00 at 3B, 7B1 and attenuated sharply at larger scales, dropping to roughly 0.01 at Qwen 2.5-32B1. Small models read the last answer and echo it. Larger models appear to do something closer to actual reasoning, or at least are less mechanically anchored to the final line.
Generation-time probes add the crucial mechanism. Garcia checked whether models pre-commit to a final answer before generating the reasoning that justifies it. They do not: pre-commitment occurred in fewer than 5% of cases1. The confound is a consumption-time readout effect. The model is not generating reasoning to support a pre-selected answer; it is selecting the answer at readout time, and the last explicit candidate wins.
Scale Matters: Why 3B, 7B Is Where the Confound Hurts Most
The scale curve is the most policy-relevant finding. At 3B, 7B, which is where most open-weight deployments live, the confound is severe1. Wrong-answer following rates are high, suffix-sensitivity scores are format artifacts, and published faithfulness numbers systematically overstate how much the model is doing with the reasoning text.
At 32B, the confound attenuates. The model is less likely to follow a contradictory final answer, and corruption benchmarks may start to measure something closer to genuine reasoning sensitivity. But “attenuates” is not “disappears,” and the literature’s headlined numbers are not cleanly partitioned by scale. A 7B result gets cited as representative of “LLM reasoning” without the qualifier that the effect is scale-dependent and reverses in magnitude across model sizes.
For teams choosing between 7B and 70B for a reasoning pipeline, the implication is concrete: a corruption-test score at 7B tells you about answer-placement formatting, not about whether your orchestration layer can trust the model to override a bad intermediate step. Do not calibrate safety thresholds on these numbers without the question-only control.
What This Means for Process Reward Models
Garcia extends the critique to process reward models (PRMs), which assign per-step credit during training. If positional sensitivity tracks answer-text placement, PRM step-level scores may reward consistency with the final answer text rather than the causal contribution of a reasoning step. A step that happens to align with the last-line answer gets positive credit; a step that is actually load-bearing but diverges from the final answer text gets penalized.
This is not a theoretical concern. PRM training pipelines already struggle with credit assignment in long chains. Adding a format confound that correlates step position with answer proximity makes the problem worse in a direction that is hard to detect without the explicit controls Garcia proposes. A PRM trained on corrupted chains where the answer is always at the end may learn to attend to answer-adjacent steps rather than reasoning-adjacent ones.
The New Minimum Bar: Three-Prerequisite Protocol
Garcia’s LastWordWinsCoT repository2 bundles the paper with reference implementations of the three-prerequisite protocol he argues should precede any corruption-based faithfulness claim:
- Question-only control: Test whether the model can answer correctly from the question alone, without any CoT. If it can, the chain may be decorative and the corruption benchmark is measuring something other than reasoning dependence.
- Format characterization: Establish the format effect size by stripping, moving, or replacing the final answer statement. Quantify how much of the reported suffix sensitivity is answer-placement sensitivity.
- All-position sweep: Corrupt at every position, not just the suffix. If only suffix corruption produces an effect, the benchmark is likely reading out answer position, not reasoning structure.
These are not additions to the existing protocol. They are a replacement of the existing protocol’s interpretive framework. A paper that reports suffix-survival without the question-only control and the all-position sweep is reporting a format effect and calling it reasoning faithfulness.
The paper’s title is dryly precise. “The Last Word Often Wins” is not a metaphor. It is a description of what the benchmark measures when the last word is an answer candidate and the model has no stronger signal to override it. Eval teams should treat every published CoT corruption score from the past two years as potentially contaminated by this confound until the three-prerequisite protocol has been applied.
Frequently Asked Questions
Does the format confound affect all chain-of-thought faithfulness methods, or only corruption benchmarks?
The critique is narrowly scoped to corruption and erasure benchmarks, which rely on position-specific perturbation and output-flip readouts. Other faithfulness probes—contrastive consistency checks, attention-head attribution, token-level saliency mapping—do not depend on where the answer text sits in the chain, so they are not directly vulnerable to this confound. However, none of those alternative methods have been validated against Garcia’s three-prerequisite protocol either, so their immunity is theoretical rather than empirically confirmed.
How much compute does the all-position sweep add compared to a suffix-only corruption run?
Corrupting at every token position multiplies eval compute roughly by chain length. For GSM8K, where chains average around 100 tokens, that is approximately 100× the cost of a suffix-only test. The question-only control, by contrast, adds negligible overhead since it requires no chain at all—just the bare prompt. In practice, the sweep dominates the protocol’s compute budget and may be infeasible for very long reasoning traces without sampling approximations.
Has anyone confirmed whether the confound fully disappears at 70B or larger model scales?
Garcia’s experiments top out at Qwen 2.5-32B, where wrong-answer following dropped to roughly 0.01 but did not reach zero. No published data exists for 70B+ open-weight models, so any claim that the effect vanishes at production scale is extrapolation. Teams running corruption benchmarks on 70B+ deployments should still apply the three-prerequisite protocol rather than assuming scale alone eliminates the confound.
Could PRMs trained on format-contaminated chains make unfaithfulness harder to detect in future models?
Yes. A PRM that systematically rewards answer-adjacent reasoning steps would produce models whose chains look faithful—they agree with the final answer—but whose causal reasoning structure remains shallow. The eval tool and the training signal would share the same positional blind spot, making the confound self-reinforcing across model generations. Detecting this compounding effect would require auditing the PRM’s step-level credit assignments against the all-position sweep, not just re-running the corruption benchmark on the trained model.
What does the question-only control reveal that a standard accuracy baseline does not?
A standard accuracy baseline tells you whether the model gets the right answer with CoT. The question-only control tells you whether the model needs the chain at all. If a 7B model solves a GSM8K problem correctly from the prompt alone, the entire reasoning chain may be post-hoc rationalization, and any corruption-test result on that chain is uninformative. This diagnostic predates Garcia’s paper but was never systematically incorporated into corruption-study protocols, which is why the confound went undetected across multiple published benchmarks.