The Last Word Often Wins: A Format Confound Inflates Chain-of-Thought Corruption Robustness Scores

Q: Does the format confound affect all chain-of-thought faithfulness methods, or only corruption benchmarks?

The critique is narrowly scoped to corruption and erasure benchmarks, which rely on position-specific perturbation and output-flip readouts. Other faithfulness probes, contrastive consistency checks, attention-head attribution, token-level saliency mapping, do not depend on where the answer text sits in the chain, so they are not directly vulnerable to this confound. However, none of those alternative methods have been validated against Garcia's three-prerequisite protocol either, so their immunity is theoretical rather than empirically confirmed.

Q: How much compute does the all-position sweep add compared to a suffix-only corruption run?

Corrupting at every token position multiplies eval compute roughly by chain length. For GSM8K, where chains average around 100 tokens, that is approximately 100× the cost of a suffix-only test. The question-only control, by contrast, adds negligible overhead since it requires no chain at all, just the bare prompt. In practice, the sweep dominates the protocol's compute budget and may be infeasible for very long reasoning traces without sampling approximations.

Q: Has anyone confirmed whether the confound fully disappears at 70B or larger model scales?

Garcia's experiments top out at Qwen 2.5-32B, where wrong-answer following dropped to roughly 0.01 but did not reach zero. No published data exists for 70B+ open-weight models, so any claim that the effect vanishes at production scale is extrapolation. Teams running corruption benchmarks on 70B+ deployments should still apply the three-prerequisite protocol rather than assuming scale alone eliminates the confound.

Q: Could PRMs trained on format-contaminated chains make unfaithfulness harder to detect in future models?

Yes. A PRM that systematically rewards answer-adjacent reasoning steps would produce models whose chains look faithful, they agree with the final answer, but whose causal reasoning structure remains shallow. The eval tool and the training signal would share the same positional blind spot, making the confound self-reinforcing across model generations. Detecting this compounding effect would require auditing the PRM's step-level credit assignments against the all-position sweep, not just re-running the corruption benchmark on the trained model.

Q: What does the question-only control reveal that a standard accuracy baseline does not?

A standard accuracy baseline tells you whether the model gets the right answer with CoT. The question-only control tells you whether the model needs the chain at all. If a 7B model solves a GSM8K problem correctly from the prompt alone, the entire reasoning chain may be post-hoc rationalization, and any corruption-test result on that chain is uninformative. This diagnostic predates Garcia's paper but was never systematically incorporated into corruption-study protocols, which is why the confound went undetected across multiple published benchmarks.

Chain-of-thought corruption benchmarks test whether a model’s final answer survives injecting errors into its reasoning chain. The standard interpretation is that high survival rates mean the model relies on its own reasoning, while low rates mean it is easily misled by the corrupted text. Gabriel Garcia’s arXiv:2605.10799¹ shows the standard interpretation is wrong: the benchmark largely measures whether the final answer appears at the end of the chain, not whether the reasoning caused it.

What CoT Corruption Studies Claim to Measure

A typical corruption study works like this. Take a correct chain-of-thought (CoT) from GSM8K or MATH, splice in a false statement near the end of the reasoning, and check whether the model still outputs the correct final answer. If the answer flips, the corrupted step is labeled “causally load-bearing”; if it survives, the step is deemed non-critical. The aggregate suffix-survival score (how often the correct answer holds when the corruption is at the end) is taken as a proxy for reasoning faithfulness.

The assumption is that the model reads the chain, weighs the reasoning steps, and either follows the corrupted suffix or overrides it based on the prior correct reasoning. Garcia shows this assumption collapses on contact with a trivial control: remove only the final answer statement while preserving every reasoning step, and see what happens.

The Format Confound: Answer-Text Readout Dominance

The confound is positional. In standard CoT benchmarks, the final answer appears in the last line or final sentence. Corruption probes that inject errors into the suffix therefore also move the answer text away from the end of the context. Garcia’s key experiment asks: does the drop in accuracy reflect the model rejecting bad reasoning, or does it reflect the model defaulting to the last explicit answer it sees?

Stripping just the final answer statement from GSM8K chains (leaving all reasoning intact) reduced suffix-corruption sensitivity by roughly 19× for Qwen 2.5-3B¹, from a pronounced drop under corruption to a nearly flat line. A within-stable replication at 7B showed 9.3× attenuation (N=76, p=7.8×10⁻³)¹. The model did not need the final answer to reason correctly, but it needed the final answer to be at the end in order for the corruption benchmark to register any effect at all.

On MATH with DeepSeek-R1-7B¹, the same manipulation produced a 10.9× recovery in suffix-survival¹, confirming the effect crosses benchmarks and model families. And on suffix-free chains, Phi-3-mini¹ flipped the pattern entirely: prefix corruption became load-bearing (Δ=−0.77, p<10⁻¹²)¹, inverting the usual suffix-sensitivity story. The position of the answer text, not the causal structure of the reasoning, is what drives the score.

The Experiments: 19× Collapse and the Conflicting-Answer Test

Garcia’s second test is blunter. Take a correct reasoning chain and append an explicit wrong final answer. The reasoning says 42; the final line says 17. Across five open-weight model families at 7B, causal-consumption accuracy under this conflicting-answer prompt fell to ≤0.02¹. The models followed the explicitly stated wrong answer almost without exception.

The wrong-answer following rate was 0.63, 1.00 at 3B, 7B¹ and attenuated sharply at larger scales, dropping to roughly 0.01 at Qwen 2.5-32B¹. Small models read the last answer and echo it. Larger models appear to do something closer to actual reasoning, or at least are less mechanically anchored to the final line.

Generation-time probes add the crucial mechanism. Garcia checked whether models pre-commit to a final answer before generating the reasoning that justifies it. They do not: pre-commitment occurred in fewer than 5% of cases¹. The confound is a consumption-time readout effect. The model is not generating reasoning to support a pre-selected answer; it is selecting the answer at readout time, and the last explicit candidate wins.

Scale Matters: Why 3B, 7B Is Where the Confound Hurts Most

The scale curve is the most policy-relevant finding. At 3B, 7B, which is where most open-weight deployments live, the confound is severe¹. Wrong-answer following rates are high, suffix-sensitivity scores are format artifacts, and published faithfulness numbers systematically overstate how much the model is doing with the reasoning text.

At 32B, the confound attenuates. The model is less likely to follow a contradictory final answer, and corruption benchmarks may start to measure something closer to genuine reasoning sensitivity. But “attenuates” is not “disappears,” and the literature’s headlined numbers are not cleanly partitioned by scale. A 7B result gets cited as representative of “LLM reasoning” without the qualifier that the effect is scale-dependent and reverses in magnitude across model sizes.

For teams choosing between 7B and 70B for a reasoning pipeline, the implication is concrete: a corruption-test score at 7B tells you about answer-placement formatting, not about whether your orchestration layer can trust the model to override a bad intermediate step. Do not calibrate safety thresholds on these numbers without the question-only control.

What This Means for Process Reward Models

Garcia extends the critique to process reward models (PRMs), which assign per-step credit during training. If positional sensitivity tracks answer-text placement, PRM step-level scores may reward consistency with the final answer text rather than the causal contribution of a reasoning step. A step that happens to align with the last-line answer gets positive credit; a step that is actually load-bearing but diverges from the final answer text gets penalized.

This is not a theoretical concern. PRM training pipelines already struggle with credit assignment in long chains. Adding a format confound that correlates step position with answer proximity makes the problem worse in a direction that is hard to detect without the explicit controls Garcia proposes. A PRM trained on corrupted chains where the answer is always at the end may learn to attend to answer-adjacent steps rather than reasoning-adjacent ones.

The New Minimum Bar: Three-Prerequisite Protocol

Garcia’s LastWordWinsCoT repository² bundles the paper with reference implementations of the three-prerequisite protocol he argues should precede any corruption-based faithfulness claim:

Question-only control: Test whether the model can answer correctly from the question alone, without any CoT. If it can, the chain may be decorative and the corruption benchmark is measuring something other than reasoning dependence.
Format characterization: Establish the format effect size by stripping, moving, or replacing the final answer statement. Quantify how much of the reported suffix sensitivity is answer-placement sensitivity.
All-position sweep: Corrupt at every position, not just the suffix. If only suffix corruption produces an effect, the benchmark is likely reading out answer position, not reasoning structure.

These are not additions to the existing protocol. They are a replacement of the existing protocol’s interpretive framework. A paper that reports suffix-survival without the question-only control and the all-position sweep is reporting a format effect and calling it reasoning faithfulness.

The paper’s title is dryly precise. “The Last Word Often Wins” is not a metaphor. It is a description of what the benchmark measures when the last word is an answer candidate and the model has no stronger signal to override it. Eval teams should treat every published CoT corruption score from the past two years as potentially contaminated by this confound until the three-prerequisite protocol has been applied.

What Landed Since: The Measurement Keeps Failing the Model

[Updated June 2026] Garcia’s paper now reads as one of three 2026 results that arrive at the same uncomfortable conclusion: the faithfulness number you publish depends more on how you measure than on what the model does. Richard Young’s classifier-sensitivity study ran identical chains through different faithfulness classifiers and watched the measured rate swing from 69.7% to 82.6%⁴, a spread wide enough to reverse model rankings, and concluded that published faithfulness numbers cannot be compared across studies that use different classifiers. C2-Faith, from Avni Mittal and Rauno Arike, splits the question another way, separating whether a judge can detect a causal error from whether it can localize one, and finds the two abilities come apart⁵. Garcia’s format confound, Young’s classifier variance, and the C2-Faith detection-versus-localization gap are three different demonstrations that the instrument is noisier than the signal it claims to read.

The mechanism is worth stating plainly, because it is what makes the confound portable across benchmarks. At readout, a small model treats the final explicit answer candidate as the highest-salience string to copy. It is not retrofitting reasoning to a pre-selected answer; the generation-time probes put pre-commitment below 5%¹. The selection happens when the chain is consumed, and the last answer-shaped token wins by default. Move that string, strip it, or contradict it, and the score moves with it, whether or not a single reasoning step changed. That is why stripping the final line cut suffix sensitivity by roughly 19×¹ instead of nudging it: the benchmark was reading the string, not the steps.

Why This Should Worry Safety Evals

CoT monitoring as a safety strategy rests on one assumption, that the visible chain is load-bearing, so reading it tells you what the model is actually doing. Garcia’s result undercuts that assumption exactly where it is cheapest to deploy. At 3B and 7B, the range where a monitor model might run inline on every request, the chain that a corruption benchmark “validates” can be a readout artifact rather than a causal trace, and a monitor that trusts the chain inherits the blind spot. This is the same failure mode Groundy covered when an LLM narrates a solver and the explanation drifts from the math: the prose tracks the answer, not the computation that produced it. It also sharpens the question raised by work on whether reasoning tokens actually make models safer, because a safety case built on visible reasoning is only as strong as the evidence that the reasoning is consumed rather than narrated after the fact.

The caveat cuts both ways. The confound is scoped to corruption and erasure benchmarks, and it weakens as models grow, so a 70B monitor reading a 70B policy model’s chain stands on firmer ground than the 7B numbers imply. But firmer is not validated, and nobody has run the three-prerequisite protocol on a 70B-plus monitor. Until someone does, a CoT-based safety eval that skips the question-only control is reporting how often the answer sits at the end of the chain, dressed up as how often the model means what it writes.

Frequently Asked Questions

Does the format confound affect all chain-of-thought faithfulness methods, or only corruption benchmarks?

The critique is narrowly scoped to corruption and erasure benchmarks, which rely on position-specific perturbation and output-flip readouts. Other faithfulness probes, contrastive consistency checks, attention-head attribution, token-level saliency mapping, do not depend on where the answer text sits in the chain, so they are not directly vulnerable to this confound. However, none of those alternative methods have been validated against Garcia’s three-prerequisite protocol either, so their immunity is theoretical rather than empirically confirmed.

How much compute does the all-position sweep add compared to a suffix-only corruption run?

Corrupting at every token position multiplies eval compute roughly by chain length. For GSM8K, where chains average around 100 tokens, that is approximately 100× the cost of a suffix-only test. The question-only control, by contrast, adds negligible overhead since it requires no chain at all, just the bare prompt. In practice, the sweep dominates the protocol’s compute budget and may be infeasible for very long reasoning traces without sampling approximations.

Has anyone confirmed whether the confound fully disappears at 70B or larger model scales?

Garcia’s experiments top out at Qwen 2.5-32B, where wrong-answer following dropped to roughly 0.01 but did not reach zero. No published data exists for 70B+ open-weight models, so any claim that the effect vanishes at production scale is extrapolation. Teams running corruption benchmarks on 70B+ deployments should still apply the three-prerequisite protocol rather than assuming scale alone eliminates the confound.

Could PRMs trained on format-contaminated chains make unfaithfulness harder to detect in future models?

Yes. A PRM that systematically rewards answer-adjacent reasoning steps would produce models whose chains look faithful, they agree with the final answer, but whose causal reasoning structure remains shallow. The eval tool and the training signal would share the same positional blind spot, making the confound self-reinforcing across model generations. Detecting this compounding effect would require auditing the PRM’s step-level credit assignments against the all-position sweep, not just re-running the corruption benchmark on the trained model.

What does the question-only control reveal that a standard accuracy baseline does not?

A standard accuracy baseline tells you whether the model gets the right answer with CoT. The question-only control tells you whether the model needs the chain at all. If a 7B model solves a GSM8K problem correctly from the prompt alone, the entire reasoning chain may be post-hoc rationalization, and any corruption-test result on that chain is uninformative. This diagnostic predates Garcia’s paper but was never systematically incorporated into corruption-study protocols, which is why the confound went undetected across multiple published benchmarks.