groundy
security

LLM Reasoning Traces Leak the Private Data They're Told to Hide

Reasoning models embed sensitive data in chain-of-thought traces omitted from final answers, creating a privacy gap that output-level safety training cannot address.

7 min · · · 3 sources ↓

What “Leaky Thoughts” Are and Why Output-Level Safety Training Misses Them

Large reasoning models produce an intermediate chain-of-thought before generating their final answer. That reasoning trace (RT) is supposed to be a private scratchpad: the model works through the problem, then delivers a clean response. arXiv:2602.24210, updated May 29, 2026 by Haritz Puerto, shows that the scratchpad is not private at all. The RT frequently contains sensitive user information the model was explicitly instructed to suppress. The paper calls these “leaky thoughts.” The final answer may correctly omit the data; the reasoning trace does not. Any system that logs, surfaces, or can be coerced into returning the trace is exposing information that output-level safety training was never designed to protect.

The distinction matters. Current safety and alignment work targets what the model says aloud: the completed response. Reasoning tokens sit in a different layer. They are generated by the same model, under the same weights, but they are not subject to the same output filters. The model was never trained to treat its own intermediate reasoning as something an adversary might read.

The Attack Surface: Prompt Injection and Trace Exfiltration

The leak is not a passive logging concern. arXiv:2602.24210 demonstrates that reasoning traces can be exfiltrated through prompt injection attacks. A malicious input prompts the model to reproduce or paraphrase the contents of its RT in the final answer, or an application that displays the trace to the user (as several commercial reasoning models do by default as of early 2026) simply renders it visible.

The practical consequence: if you assume that redacting the final answer is sufficient, you are assuming the adversary never sees the trace. That assumption is false in any system where the trace is stored, transmitted, or displayed.

Staged Decoding: Separate LoRA Adapters for Reasoning vs. Answer

Puerto’s proposed fix is architectural. Rather than relying on the model to self-censor during reasoning, arXiv:2602.24210 introduces Staged Decoding, which splits generation into two components: RT generation and answer generation. Each component gets its own LoRA adapter, separately fine-tuned for instruction-following (IF) within its scope.

The approach is paired with a supervised fine-tuning (SFT) dataset designed to teach the model to follow privacy directives throughout its reasoning process, not just at the point where it formulates its final response. The authors reframe the problem as a controllability issue: privacy instructions are instructions, so improving IF inside the RT directly reduces leaks.

This is a structural intervention rather than a prompt-engineering one. A single model generating both the trace and the answer under shared weights has no architectural reason to treat them differently. Decoupling them creates the option to optimize each for different constraints: the trace adapter for compliance, the answer adapter for utility.

Privacy Benchmarks: Up to 51.9 Percentage Points of Improvement

Evaluated across six models from two families (1.7B to 14B parameters), two instruction-following benchmarks, and two privacy benchmarks, Staged Decoding produces improvements of up to 20.9 points on IF metrics and up to 51.9 percentage points on privacy benchmarks, according to the results reported in arXiv:2602.24210.

Those are the headline figures from the best conditions. The “up to” qualifier is doing real work here. The paper evaluates multiple model sizes and multiple benchmark configurations, and the gains are not uniform across all conditions. The 51.9pp privacy improvement and the 20.9 IF-point gain represent the strongest results; smaller models and different benchmark pairings show more modest improvements.

The IF/Reasoning Trade-off

The privacy gains come at a cost. arXiv:2602.24210 explicitly notes that task utility can decrease when stronger instruction-following is enforced within the reasoning trace. The mechanism is straightforward: unconstrained reasoning allows the model to work through a problem freely, even if that means reasoning about data it was told to ignore. Constraining the trace to comply with privacy directives narrows the space of reasoning paths available to the model.

This is the core tension. Privacy compliance inside the trace competes with reasoning performance. The paper does not resolve the trade-off; it measures it. Deployment teams will need to decide, for each use case, how much reasoning accuracy they are willing to sacrifice for trace-level privacy. There is no free parameter tweak that delivers both simultaneously.

Complementary Evidence: Indirect Attacks Bypass Reasoning Models

A separate study, CoPriva (arXiv:2505.15805), tested ten state-of-the-art LLMs, including reasoning-capable models QwQ-32B, DeepSeek-R1, and o4-mini, against indirect-query attacks designed to extract private information. Indirect attacks raised leak rates by 40 or more percentage points over direct attacks, per analysis of the benchmark results. Reasoning-capable models showed no statistically significant advantage at resisting indirect exfiltration compared to non-reasoning models.

This finding is from a different paper, a different experimental setup, and a different benchmark suite than arXiv:2602.24210. The two results are complementary but not directly comparable. What they converge on is the same structural observation: the reasoning trace is an exposed channel, and the model’s ability to “think carefully” does not translate into an ability to keep secrets during that thinking.

What Infrastructure Teams Should Do Now

Until model-level solutions like Staged Decoding are integrated into shipping models, the burden falls on serving infrastructure.

First, treat the reasoning trace as an untrusted output. Any pipeline that logs, caches, or forwards the trace needs to apply the same redaction and filtering that would be applied to the final answer. This is not the default behavior in most serving frameworks.

Second, restrict trace visibility. APIs that return reasoning tokens to the caller should default to omitting them, with an explicit opt-in that documents the privacy implications. Consumer-facing applications that display chain-of-thought to end users should assume the trace may contain information the user did not consent to share with whoever is reading the conversation.

Third, evaluate prompt-injection defenses specifically for trace exfiltration. Standard input sanitization targets the final answer; the trace may be readable through injection patterns that do not affect the answer at all. arXiv:2602.24210 demonstrates this gap explicitly.

The research makes one thing clear: output-level redaction is insufficient for reasoning models. The trace is a separate output channel, and it requires its own security treatment. The models will not police it themselves.

Frequently Asked Questions

Do these results apply to closed frontier models like GPT-5 or o3?

The 51.9pp privacy improvement was measured on models no larger than 14B parameters from two open families. However, CoPriva’s separate evaluation of QwQ-32B and DeepSeek-R1 (well above 14B) found that reasoning-capable models at larger scales still leaked private data at high rates under indirect prompts, with no advantage over non-reasoning models. The leak phenomenon appears to transfer to larger architectures even though Staged Decoding’s specific gains have not been validated there.

What does deploying Staged Decoding require in practice?

The approach applies two separate LoRA adapters to the same base model, one for trace generation and one for answer generation. Serving infrastructure must handle adapter switching between the reasoning and answer phases of a single request. The authors released their supervised fine-tuning dataset and code publicly, so teams can retrain adapters for domain-specific privacy directives without building training data from scratch.

Can indirect-query attacks bypass a trace adapter trained for compliance?

Puerto’s benchmarks measure direct instruction-following and privacy adherence. CoPriva’s parallel finding that indirect-query attacks raise leak rates by 40+ percentage points across all ten tested models suggests a trace adapter optimized for direct compliance may not hold under adversarial phrasing. The two studies target different threat surfaces: Puerto tests whether the model obeys explicit privacy instructions; CoPriva tests whether obfuscated extraction prompts can pull the same data out anyway.

How much reasoning accuracy do teams sacrifice for the privacy improvement?

The paper reports the trade-off qualitatively but does not publish a single accuracy-cost figure. The 51.9pp privacy gain and 20.9 IF-point gain come from the strongest configurations tested. Smaller models in the 1.7B range showed more modest improvements, so teams deploying models below 7B parameters should expect proportionally smaller privacy gains relative to the reasoning quality they give up.

Has Staged Decoding been independently replicated?

The paper was first submitted to arXiv on February 27, 2026 and received its v2 update on May 29. Results come from a single group evaluating models between 1.7B and 14B parameters. No independent replication has been published as of early June 2026. The publicly released code lowers the barrier for third-party validation, but until that happens the headline gains remain preliminary.

sources · 3 cited

  1. From Leaky Thoughts to Private Reasoning primary accessed 2026-06-02
  2. CoPriva: LLM Security Policy Preservation Against Indirect Attacks primary accessed 2026-06-02
  3. CoPriva Benchmark Analysis: Indirect Attacks Raise LLM Leak Rates by 40+ Points analysis accessed 2026-06-02