groundy
ethics, policy & safety

When an LLM Narrates a Solver, the Explanation Drifts From the Math

A June 2026 arXiv paper isolates the narration gap in LLM-solver loops: prompt injection can invert a verified verdict at the prose stage, breaking reasoning-log audits.

8 min · · · 3 sources ↓

A paper posted to arXiv on 17 June 2026 (arXiv:2606.19588) isolates a flaw in LLM-solver agent loops that most production designs assume away. The verdict a SAT or SMT solver returns can be formally sound, and the prose the LLM hands the user can still be wrong. The authors call the gap narration and conclude that robustness does not reach the answer the user finally reads.

What is the LLM-solver loop, and where does soundness leak?

The paper models the loop as three stages: formalize the question, decide it with a SAT or SMT solver, and narrate the result. It locates the soundness loss in the third.

A SAT or SMT solver is the opposite of a neural network in one important respect. Its answer comes with a certificate the caller can check independently; it is a sound, verifiable verdict rather than a sample from a distribution. Chain-of-thought reasoning, by contrast, is generated token by token from the model’s learned distribution and carries no formal guarantee (Wikipedia, “Large language model”). That asymmetry is the premise of neuro-symbolic agent designs: attach a trustworthy backend to a fluent front end and inherit the backend’s soundness.

The contribution is to show where that inheritance breaks. Stages one and two can be tight. The solver returns a verdict the system can verify. Stage three is where the guarantee evaporates, because the verdict now has to pass through the language model to reach the user, and the authors identify narration as the previously unstudied stage where soundness is lost.

What does ‘narration’ mean when a solver is doing the math?

Narration is the step that converts a solver’s formal output into the user-facing answer. It is the only stage of the loop the model performs alone, with no certificate attached, and it is the stage prior work on LLM-solver hybrids has not studied directly.

Mechanically, the solver returns something compact and structured: a satisfying assignment, an unsat core, a model. The user asked a question in English. Something has to bridge the two, and in current designs that something is an LLM summarising the solver output. The paper’s central claim is that this translation is not neutral. It is a decision procedure in its own right, performed by a system whose behaviour is shaped by fine-tuning and sampling, and it is the place where a sound backend meets an unsound sentence.

This reframes what “the answer” even is in an LLM-solver loop. There are two of them: the one the solver computed and the one the model wrote down. The paper treats only the second as the user-facing answer, then asks whether the first guarantees the second. Its finding is that it does not.

Does certificate gating make the answer trustworthy?

No, not all the way through. Certificate gating keeps the solver verdict sound, but the authors report that an adversary can still invert a verified conclusion across phrasings and channels.

The gating stage is the part of the loop that checks the solver’s certificate before the result is allowed to proceed. Done well, it makes the verdict itself hard to fake; the model cannot simply assert the solver said something it did not, because the certificate will not match. This is why the paper still treats the solver verdict as sound even under attack.

The problem is what happens after the gate. The verified verdict has to become a sentence, and the sentence is where the soundness runs out. Two answers that differ only in phrasing can both be consistent with a certificate while communicating opposite conclusions to a reader. The certificate certifies the solver’s output. It does not certify the narration’s fidelity to that output.

How does prompt injection invert a verified conclusion?

Under prompt injection, the paper evaluates five open-sourced models and finds the adversary does not need to defeat the solver at all. The attack surface is the narration stage, reachable through how the verdict is phrased and through which channel delivers it.

This is the result that should concern anyone shipping a solver-backed agent into a setting where inputs are untrusted. The backend that was supposed to be the trustworthy component is intact. The solver returned exactly what it should have returned. The certificate checked out. The model then narrated that verdict into a sentence that inverts its meaning, because the adversarial input shaped the narration rather than the verdict. The formal guarantee protected the wrong layer.

The phrase “across phrasings and channels” is doing real work in the finding. It is not that one particular sentence was vulnerable. The inversion survives paraphrase, and it survives being routed through different output paths. Hardening one wording or one channel does not help, because the adversary is exploiting the model’s tendency to narrate a fixed verdict in a way the input can bend.

Why don’t hardened prompts close the gap?

A hardened-prompt mitigation reduces injection significantly, the paper reports, but cannot eliminate it and still suffers under adaptive attack. The authors’ headline conclusion is that robustness does not reach the answer the user finally reads.

Hardening the prompt is not useless. The paper measures a demonstrable reduction in the attack rate. What it also measures is a residual that survives hardening, and a further residual that survives once the adversary is allowed to adapt. The authors treat that residual as load-bearing rather than as a rounding error, because the entire point of pairing an LLM with a solver was to obtain a guarantee that prompt-level defences cannot provide on their own. If the narration stage reopens the door the solver was meant to close, the architecture has traded a probabilistic defence for a probabilistic defence plus a formal component whose guarantee stops one layer short of the user.

The design implication the paper draws is uncomfortable. You cannot harden-prompt your way out of the narration gap, because the gap is structural to having a language model translate a verdict into prose. Hardening reduces the rate. It does not change the fact that the sentence the user reads is produced by a model and certified by nothing.

What does this cost compliance, debugging, and agent oversight?

Any workflow that treats the LLM’s running narration as an audit trail inherits the narration gap. Compliance reviews, debugging sessions, and oversight logs all rest on the assumption that the prose faithfully represents what the backend decided.

This lands hardest on agent systems that pair an LLM with a symbolic or numeric backend precisely because the backend is supposed to be the checkable part. If the audit log is the model’s narration of solver outputs, and the narration can invert a verified conclusion, then the log is not an audit of the solver. It is an audit of the model’s summary of the solver, which is a different and weaker artefact. A reviewer reading that log has no way, from the log alone, to tell whether the sentence in front of them is the verdict the solver returned or its inverse.

A parallel paper posted the same month supplies the architectural context. Decoupled Search Grounding (arXiv:2606.18947) argues that real-time grounding in production LLM agents should be an inspectable interface boundary moved outside the reasoning model, through an MCP-compatible gateway. That is the class of agent-system design whose prose-facing stage the narration-gap finding complicates: if the grounding interface is the thing you audit, the narration that summarises it is where the audit can lie.

For teams shipping solver-backed agents, the paper sets a floor on what such a system can honestly promise. The backend is sound. The narration is not. Treating the model’s prose as untrusted by default is the conservative reading of the result; the guarantee, as the authors frame it, stops one sentence short of the reader.

Frequently Asked Questions

Did the evaluation include closed-source models, or only open-sourced ones?

Only open-sourced models were evaluated. The five can be inspected and run locally, so the reported inversion rates reflect an adversary with full system access rather than black-box API attacks. Closed proprietary systems were not tested, so the gap’s size on those remains unmeasured, though the structural claim holds for any architecture with a narration stage.

What should an oversight log capture to avoid inheriting the narration gap?

The log should record the solver’s raw output and certificate, not the model’s prose summary of them. A log that stores only the narration is auditing the summariser rather than the solver, and a reviewer cannot tell from it whether the sentence matches the verdict or inverts it. Capturing the formal artefact alongside the prompt and input that produced the narration is what makes the backend’s verifiability usable in review.

Does moving the grounding interface outside the model, as the DSG paper proposes, close the gap?

It relocates the problem rather than removing it. An MCP-compatible gateway makes the grounding interface inspectable, which helps auditability, but some component still has to translate the structured solver output into the sentence the user reads, and that translation carries no certificate. The narration gap migrates to whichever boundary performs that last-mile summarisation.

Is this framed as an interpretability finding or a security finding?

It is filed as a security finding. The arXiv listing classifies the paper under cs.CR (Cryptography and Security) alongside cs.AI and cs.LO, a bucket unfaithful-chain-of-thought work rarely lands in. The structure follows a threat model, with an adaptive adversary, prompt injection as the vector, and residual attack rates under hardening, which positions narration as an attack surface rather than a faithfulness curiosity.

Does the narration gap apply to a pure-LLM agent with no solver backend?

No, and the narrowness is the point. The gap requires a formally sound verdict to leak through, so without a solver in stage two there is no guarantee to invert, and the unfaithful-CoT literature already covers that setting. The finding only bites once a team has bolted a verification layer onto the loop expecting it to guarantee the user-facing answer.

sources · 3 cited

  1. Large language model (Wikipedia) en.wikipedia.org community accessed 2026-06-21