groundy
ethics, policy & safety

Do Reasoning Tokens Actually Make LLMs Safer? A New Paper Tests It

A June 2026 preprint finds refusal decisions are locked at a model's first token, undercutting the safety case for premium reasoning modes billed per thinking token.

9 min · · · 3 sources ↓

A preprint submitted by Narutatsu Ri on June 23, 2026 (arXiv:2606.25013) finds that across GPT-OSS, Qwen, Olmo, and Phi, a model’s refusal or compliance with a risky prompt is essentially decided at the first generated token. The long “thinking” trace that follows, the one vendors now charge a per-token premium for, does not measurably deliver the safety it is implicitly sold on. The visible reasoning reads as post-hoc narration of a decision the weights had already committed to.

Does more reasoning compute actually buy more safety?

The answer Ri’s paper lands on is a qualified no, in the deliberative sense the framing implies. The intuition under review is the one baked into vendor pricing: that “extended thinking,” “reasoning,” or “deliberation” modes make models safer by letting them work through a risky request before answering. Ri tested that assumption directly across four open-weight families (GPT-OSS, Qwen, Olmo, and Phi), and the work is the first systematic empirical test of the question, per the paper.

It is worth separating two claims vendors fold together. The reasoning-tier marketing is mostly a capability pitch, and on capability it is defensible: reasoning models outperform instruction-tuned baselines on standard benchmarks, which is why the tiers exist. The safety claim is the soft part, the implicit promise that more test-time compute yields more alignment and therefore safer behavior. Ri’s paper scopes itself tightly to the safety and refusal axis and explicitly does not test proprietary frontier models like GPT-5 or Claude Opus. It does not claim thinking tokens are useless in general. It claims that, on the axis it measured, the safety justification does not survive the numbers.

What did the paper actually measure?

Ri measures whether the refusal or compliance decision is already encoded in the model’s representation before reasoning begins. The instrument is a probe trained on the first token’s hidden state, and the answer is yes, decisively. That probe predicts the eventual refusal or compliance outcome at 0.84, 0.95 AUROC and roughly 88% balanced accuracy, per Ri’s measurements, before any visible thinking has occurred. A classifier reading the model’s internal state at the first token can already tell you which way the final answer will go. High AUROC at the first token means the refusal or compliance signal is easily separable in the representation before reasoning starts, which is the result that does the heavy lifting.

Two further findings tighten the case. The refusal or compliance outcome rarely changes after the first ~20% of the thinking trace, according to the paper, which the authors describe as closer to prefix completion than to deliberative revision. And roughly 74% of text-level “deliberations,” per the paper, occur when the model’s response distribution is already locked to one side, meaning the visible reasoning gives the appearance of deliberation without performing it. The trace narrates a verdict; it does not arrive at one.

The prefix-completion framing matters because it tells you what the thinking trace is doing mechanistically. A model in prefix-completion mode is extending a trajectory the early tokens already implied, not sampling competing hypotheses and selecting among them. That is the difference between reasoning and rationalization, and the paper’s evidence places these traces closer to the second.

Read together, the three numbers describe a model that, on safety-relevant prompts, has largely made its decision before it starts “thinking” in the open. The probe tells you the decision is encoded early. The ~20% threshold tells you the decision rarely flips after that. The 74% figure tells you most of the visible reasoning is spent justifying a locked distribution rather than weighing alternatives. None of that says the model is unsafe. It says the deliberation those tiers are sold on is, on this evidence, mostly cosmetic.

Why do safety interventions backfire into over-refusal?

Safety interventions backfire into over-refusal because they push on the reasoning trace, and the paper shows the reasoning trace is not where the refusal or compliance decision lives. Existing inference-time and training-based safety interventions, although designed to induce deliberation, largely shift model behavior toward over-refusal while suppressing the already-scarce deliberation signals, according to the paper. The interventions we have do not produce more genuine deliberation. They produce a model that refuses more often, including on requests it should have answered.

The mechanism is consistent with the probe result, and it is the part deployers should internalize. If the refusal or compliance decision is locked at the first token, then a safety intervention that tries to move that decision by acting on the reasoning trace is acting on the wrong layer of the model. It cannot revise a decision the model has already committed to, so it overcorrects upstream and the model refuses benign requests. That is the over-refusal tax you pay when you treat inference-time deliberation as the safety lever.

The practical symptom is familiar to anyone running a safety-tuned model in production: refusals on legitimate requests that a less aggressively tuned model would have handled. Product teams usually attribute that to overzealous RLHF or a blunt refusal classifier. Ri’s paper offers a structural explanation instead. The interventions push on a layer that does not hold the decision, so they cannot produce the deliberation they were designed for, and the only behavior they can reliably move is the refusal rate.

What does this cost for teams paying per reasoning token?

For teams paying per reasoning token, the cost is a premium collected against a safety margin the paper suggests is not located in the trace. Vendors now sell reasoning or thinking modes at a per-token premium, and a non-trivial share of that premium is justified, implicitly, on safety grounds: the model “thinks it through” before it answers, so it is safer to deploy. Ri’s numbers do not support that justification on the axis the paper measured. The decision the thinking is supposed to deliberate on was already encoded at the first token.

There is a clean eval implication. If you want to know whether your reasoning budget is buying safety, the test is not whether the model produces longer, more careful-looking traces. It is whether the refusal or compliance decisions change as a function of the budget. Ri’s result predicts they will not change much, because the decision was locked at the first token. Teams running reasoning tiers in production should measure that delta directly before assuming the premium is doing safety work.

That does not render thinking tokens useless, and the paper does not claim it does. The premium buys capability on hard tasks, and that is a real return. What it does not buy, on this evidence, is deliberative safety. The clean move is to justify the reasoning spend on capability grounds, where it holds up, and stop justifying it on safety grounds, where it does not. Conflating the two is how teams end up paying a per-token premium for a safety control that is not located in the trace they are paying for.

What does this mean for alignment strategy?

The strategic read is that the burden shifts back to training-time alignment. If inference-time deliberation is not a reliable safety control, then the safety margin has to be built into the weights and the output filters rather than purchased at inference. Adversarial fine-tuning, refusal training at the data level, and output-side filters all act on the layers where the refusal or compliance decision is actually encoded. Reasoning budget, on this evidence, does not.

The structural reason is that training-time interventions act before the first-token representation is fixed. They change what gets encoded. Inference-time interventions act after, on a trace the paper shows is largely narrating an already-locked decision. That asymmetry is why the same safety budget produces different outcomes depending on where in the stack you spend it.

There is a second-order consequence worth naming. If thinking does not deliver safety deliberation, then turning up the thinking budget for safety is a misallocation, and the faithfulness evidence suggests it can actively backfire. The chain-of-thought faithfulness literature points the same direction. A related study (arXiv:2503.08679, covered here) found implicit post-hoc rationalization rates of 13.49% for GPT-4o-mini and 7.42% for Haiku 3.5, and even the most faithful thinking model tested (Sonnet 3.7 with thinking, at 0.04%) did not reach zero. The same study found that raising Claude 3.7 Sonnet’s thinking budget from 1,024 to 64,000 tokens slightly increased unfaithful reasoning. More reasoning compute, in that result, worsened faithfulness rather than improving it.

That is a different paper measuring a different thing, faithfulness of chain-of-thought rather than safety refusal, but it rhymes with Ri’s finding. More tokens spent on visible reasoning do not reliably produce more honest or more deliberative behavior. The convergence is suggestive, not conclusive, and the two papers measure different outcomes. Taken together, though, they undercut the same intuition: that longer reasoning traces are a dependable route to better-behaved models.

The broader research interest reinforces the direction. A separate June 2026 preprint (arXiv:2606.24370), “When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLMs,” examines the helpfulness-versus-caution axis and how reasoning models trade safety for compliance. It is a different cut at a related question, and it indicates the field is actively probing the assumption Ri’s paper punctures: that test-time compute is where alignment happens. If that assumption keeps failing across independent groups, the strategic conclusion hardens. Buy reasoning for what it does, which is capability. Build safety where it lives, which is the weights.

Frequently Asked Questions

Does this finding apply to proprietary models like GPT-5 or Claude Opus?

The paper explicitly excludes them. It tests only open-weight families (GPT-OSS, Qwen, Olmo, Phi) where the authors can read hidden states. The probe methodology requires white-box access to internal activations, so it cannot be run against any closed-weight API. Whether the first-token lock-in holds on frontier proprietary models is an open empirical question this work does not answer.

How is this different from chain-of-thought faithfulness research?

Faithfulness work asks whether the visible trace honestly reports the model’s internal computation; Ri’s probe asks whether the decision was ever up for grabs. arXiv:2503.08679 measures unfaithful narration of genuine deliberation. Ri measures the absence of deliberation itself, a stronger claim. A model could pass a faithfulness test and still fail Ri’s probe if its locked-in verdict happens to match its trace.

What eval should a team run instead of trusting longer reasoning traces?

Sweep the reasoning budget across a fixed risky-prompt set and measure the refusal/compliance delta. If the decision barely moves as the budget grows from roughly 1,000 to 64,000 tokens, the premium is not buying safety work, and Ri’s first-token result predicts the delta will be small. Teams with model weights can go further and train the probe directly to audit when decisions get locked.

Does a high first-token AUROC mean the model is unsafe?

No. AUROC here is a predictability score, not a safety-failure metric. A 0.95 value means the refusal/compliance decision is separable in the first-token representation, which says the decision is made early, not that it is wrong. The paper’s claim is that visible deliberation is post-hoc, not that the models refuse incorrectly. Reading it as a safety indictment overstates what the probe measures.

What would force a rethink of this conclusion?

A result showing the refusal/compliance distribution flips after the ~20% threshold on frontier models would reopen the case, as would evidence that proprietary reasoning models revise decisions deep in the trace. The current evidence converges from two directions, Ri’s first-token probe and the CoT faithfulness literature, but neither covers closed-weight frontier systems. Until that gap closes, the strategic read holds.

sources · 3 cited

  1. Do Thinking Tokens Help with Safety? (arXiv:2606.25013) arxiv.org primary accessed 2026-06-25