Cascading Hallucination in Agentic RAG: When One Bad Retrieval Poisons the Chain

Most RAG hallucination detectors check the final answer against retrieved context. That works for single-hop queries. For multi-step agentic pipelines, where one bad retrieval feeds into the next reasoning step and the next, CHARM (arXiv:2606.04435), short for Cascading Hallucination Aware Resolution and Mitigation, shows that approach catches fewer than one in five cascaded errors. The single-author preprint from Saroj Mishra, submitted June 3, 2026, formally defines a failure mode where a fabricated retrieval at hop one becomes established fact by hop three, and per-step grounding checks never notice.

Why per-step grounding is not enough

The standard playbook for RAG reliability runs something like this: retrieve chunks, check relevance, generate an answer, then run a fact-checking pass on the output. Tools like SelfCheckGPT and RAGAS operationalize this by comparing the final response against the retrieved context or by sampling multiple outputs and measuring consistency. The assumption is that errors are local: if each step is grounded, the chain is sound.

That assumption holds for single-hop retrieval. It breaks when an agent reasons across multiple retrieval steps, because a fabrication at step i becomes part of the context window for step i+1. The downstream step retrieves additional information, reasons over the combined context (which now contains a fabricated claim), and produces output that is internally consistent but factually contaminated. Standard detectors see a well-grounded step; they do not see that the ground shifted underneath it.

CHARM’s evaluation quantifies the gap: output-level detectors caught only 18.5% of cascaded errors across their test pipelines, compared to an 82.1% error-propagation reduction using mid-chain intervention. That is a 4.4× improvement, and the number is worth sitting with. If your monitoring is output-only, you are missing over 80% of the class of error most likely to produce plausible-sounding falsehoods.

Four conditions that define a cascading hallucination

CHARM formalizes cascading hallucination via four conditions that must all hold simultaneously (arXiv:2606.04435, §3):

Factual error at stage i. A retrieval or reasoning step introduces a claim that is false or unsupported by the source data.
Corrupted context propagated as valid. The error is passed to stage i+1 not as a candidate claim but as established context, indistinguishable from verified facts.
Local coherence under global falsity. Downstream output is conditionally coherent given the corrupted input. The reasoning is valid; the premises are not.
Monotonically non-decreasing error magnitude. Errors do not self-correct across hops. They accumulate or amplify.

The third condition is the critical one. It is what makes cascading hallucination invisible to per-step detectors: each stage, examined in isolation, looks correct. SelfCheckGPT sees a response that follows from its context. RAGAS sees high faithfulness scores. The problem is that the context itself was poisoned two hops ago, and no single-step check traces back that far.

Why existing detection tools fail on cascaded errors

CHARM tested against three categories of existing approaches, all of which are widely deployed in production RAG systems:

SelfCheckGPT and self-consistency methods compare multiple sampled outputs or check response consistency. These catch contradictions within a single step but cannot detect errors that are consistently propagated across steps. If stage 3 confidently repeats the fabrication from stage 1, self-consistency flags nothing.
RAGAS-style faithfulness metrics measure whether a response is supported by its retrieved context. In a cascade, it is. The context was corrupted upstream, but the response is faithful to that corrupted context.
LLM self-correction, where the model reviews its own output, fails for a different reason: confirmation bias. As the CHARM paper notes, the agent reinforces the cascade because downstream reasoning appears logically sound relative to the corrupted intermediate context. The model is checking its work against the same poisoned premises that produced the work.

A comprehensive hallucination survey (arXiv:2510.06265) categorizes detection into five approaches (retrieval-, uncertainty-, embedding-, learning-, and self-consistency-based) and mitigation into four (prompt, retrieval, reasoning, model-centric training), concluding that no single approach suffices. CHARM’s chain-level framing explains why: these approaches are designed for per-step or per-output checks, not for tracking error propagation across a multi-hop DAG.

CHARM’s detection architecture

CHARM models the multi-step reasoning pipeline as a weighted directed acyclic graph where nodes are pipeline stages and edge weights represent the error propagation probability P(ε_{i+1}|ε_i). Cascade detection identifies paths where the cumulative product of propagation probabilities exceeds a safety threshold θ (arXiv:2606.04435).

The framework has four components:

Stage-level fact verification. Each retrieval and reasoning step is checked independently against source data, similar to existing grounding checks but applied at every node in the DAG rather than only at the output.
Cross-stage consistency tracking. Claims from earlier stages are traced through downstream reasoning. If stage 4 cites a “fact” that stage 1 fabricated, the consistency tracker flags the provenance gap.
Confidence propagation monitoring. Rather than treating each stage’s confidence score in isolation, CHARM models how confidence degrades (or, tellingly, does not degrade) as errors propagate. A stage that produces high-confidence output from low-confidence input is a cascade signal.
Cascade resolution triggering. When the cumulative propagation probability exceeds θ, a resolution step intervenes: re-retrieving, re-verifying, or halting the chain.

Component ablations in the paper confirm that each module contributes to overall cascade coverage; removing any one degrades detection (arXiv:2606.04435).

Benchmark results

CHARM was evaluated on three established multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA) and a custom adversarial dataset, across LangChain agentic pipeline configurations:

Metric	CHARM (mid-chain)	Output-level detectors
Cascade detection rate	89.4%	18.5%
Error propagation reduction	82.1%	18.5%
False positive rate	5.3%	not reported
Per-stage latency overhead	215 ms ± 18 ms	baseline

The 4.4× gap in error-propagation reduction is the headline number, but the false positive rate matters for production viability. At 5.3%, roughly one in twenty pipeline stages will trigger a false cascade alert. Whether that is acceptable depends on the cost of your resolution step: if resolution means re-retrieving a document and re-running a reasoning pass, a 5.3% false-positive rate on a 10-hop pipeline means roughly one unnecessary intervention every two runs. That is manageable. If resolution means flagging for human review, it is almost certainly too noisy.

The detection rate of 89.4% also means roughly one in ten cascades slips through. This is not a solved problem; it is a substantially better tool for a problem that was previously nearly invisible.

What mid-chain intervention costs

The 215 ms per-stage overhead is the number that should make engineering teams pause. In a 5-hop pipeline, that is roughly one second of added latency before accounting for any re-retrieval or re-reasoning triggered by cascade resolution. In a 10-hop pipeline, it is two seconds. For real-time conversational agents, that is a serious cost. For batch analytical workflows, it is likely negligible.

The real cost is not latency. It is architectural. Mid-chain intervention requires:

Provenance metadata at every stage. Each node in the DAG needs to carry structured claims, their source references, and their confidence scores. LangChain and LlamaIndex do not surface this by default. Adding it means wrapping or replacing framework-provided chain primitives.
A cross-stage consistency store. Something needs to hold the state that tracks which claims from stage 1 are being cited at stage 4. That is either a shared memory layer or an external state store, and it is not something most RAG deployments currently include.
A decision policy for the threshold θ. Set it too low and every pipeline triggers resolution. Set it too high and you miss the cascades you are trying to catch. The paper does not prescribe a default; teams will need to calibrate against their own error distributions.

Adding provenance tracking to existing pipelines

The paper does not detail a step-by-step integration path for production LangChain deployments (it describes CHARM as operating “alongside” existing pipelines), but the DAG model suggests a practical approach:

Step 1: Instrument claim extraction. After each retrieval and reasoning step, extract the specific factual claims the step produces. This does not require CHARM; it is structured logging. Each claim gets a unique ID, a source reference, and a confidence score.

Step 2: Add cross-hop claim tracking. Before a downstream step runs, check whether its input context contains claims from prior steps. Flag any claim that was not verified against source data at the point of origin. This is the provenance gap that cascading hallucination exploits.

Step 3: Propagate confidence scores. If step 2 produces output with high confidence but its input contains a low-confidence claim from step 1, surface that mismatch. The downstream step’s confidence should be conditioned on the upstream claim’s reliability, not computed independently.

Step 4: Set a cumulative propagation threshold. When the product of per-hop propagation probabilities exceeds your tolerance, trigger re-retrieval or halt. The threshold is domain-specific. A medical diagnosis agent should threshold lower than a news summarizer.

None of this requires adopting CHARM as a library. The value of the paper is the formal model and the empirical evidence that per-step checks miss 80% of cascaded errors. The implementation is an engineering problem, and it is one that teams running multi-hop agentic RAG should start thinking about now, because the alternative is deploying agents that confidently propagate fabrications across reasoning chains while every per-step monitor reports green.

A separate line of work takes a different approach to the same problem. A January 2025 study (arXiv:2501.13946) demonstrated that orchestrating multiple specialized AI agents in sequence, each using a distinct LLM, with a final agent evaluating hallucination KPIs, can reduce fabrication rates through structured JSON inter-agent communication via the OVON framework. That approach stacks independent agents as reviewers rather than tracking provenance within a single chain. Whether stacking or tracking proves more practical in production likely depends on pipeline length and latency budget: stacking adds entire model calls, while provenance tracking adds overhead per hop but keeps the chain intact. The two approaches are complementary, not competing, and neither is a complete solution on its own.

A counterpoint: cascades that do not always amplify

CHARM’s fourth condition assumes error magnitude is monotonically non-decreasing across hops. A second June 2026 preprint complicates that assumption. Hallucination Cascade (arXiv:2606.07937), from Jamshidi and colleagues, tracked factual inconsistency as responses passed between separate agents in multi-agent chains rather than within a single retrieval pipeline. Across GPT-5.3, DeepSeek-V3, and LLaMA-3-70B-Instruct, the normalized hallucination score fell from 0.422 at the first agent to 0.272 at the final agent in three-agent chains. Deeper chains suppressed fabrication instead of compounding it. [Updated June 2026]

That is the mechanism difference between the two papers, not a contradiction. CHARM models one agent carrying its own corrupted context forward, where nothing re-examines the poisoned premise. A multi-agent chain inserts a fresh model at each hop, and a downstream agent that did not generate the upstream claim can challenge it. The later agents behave as implicit reviewers, which is the same intuition behind stacked-reviewer designs like OVON.

The suppression is not free. Jamshidi’s group reports factual accuracy slipping from 0.789 to 0.769 as chains deepened, because an agent that overwrites a predecessor’s hallucination sometimes overwrites a correct answer too. They also found the effect is domain-sensitive, with lower hallucination on scientific prompts and higher rates on abstract or open-ended ones. A defense tuned on one corpus will not carry its numbers to another, which echoes CHARM’s own caveat about transfer beyond multi-hop QA.

Read side by side, the two results sharpen the guidance. If your architecture is a single agent looping over retrievals, assume cascades amplify and instrument the chain the way CHARM prescribes. If it is several distinct agents passing messages, some self-correction comes for free, and the problem shifts toward preserving the correct answers the reviewers might trample. Groundy’s coverage of council-mode multi-agent voting found the same tension: adding reviewers cut hallucination at a measurable token cost, with diminishing returns past a point.

The bad retrieval is sometimes planted

Both papers treat the stage-one error as accidental: a noisy retriever, a misranked chunk, a model that guessed. In adversarial settings it is not. A poisoned corpus can guarantee the hop-one fabrication that CHARM’s first condition requires, using documents engineered to pass stage-level fact verification. Work on conflict-aware retriever poisoning shows attackers can inject claims that read as well-sourced precisely so they survive the per-step grounding check. Cascading hallucination is the propagation half of that attack: poison the retrieval, and the agent’s own reasoning carries the payload the rest of the way. Provenance tracking helps here for a reason CHARM does not stress, which is that it forces every downstream citation back to a named source and raises the cost of a fabrication that has no real provenance to point at.

Treat CHARM’s numbers as a preprint

CHARM is a single-author preprint with no peer review and, as of late June 2026, no independent reproduction. Its 89.4% detection rate comes from the author’s own pipeline against self-selected benchmarks, and the false-positive accounting omits the output-level baseline entirely. The DAG formalism and the four-condition definition are the durable contribution; the headline percentages are a first data point, not a settled benchmark. Detection research moves fast and disagrees with itself often. Recent work questioning whether fixed-layer probes detect hallucination reliably is a reminder that even the per-stage verifier CHARM depends on is an open problem, not a solved primitive. A cascade detector is only as good as the stage-level check feeding it.

Frequently Asked Questions

Does cascading hallucination occur in single-agent tool-calling loops, or only in multi-agent setups?

The CHARM formal definition applies to any pipeline where output from one reasoning step becomes context for the next, regardless of whether a single agent iterates over tool calls or multiple agents pass messages. The DAG model treats both cases identically: each tool-call cycle is a node, and the error propagation probability P(ε_{i+1}|ε_i) applies per edge. A lone ReAct agent running five retrieval-action cycles faces the same cascade risk as a five-agent chain.

How does the OVON multi-agent stacking approach differ from CHARM’s provenance tracking?

OVON (arXiv:2501.13946) routes each agent’s output through a separate reviewer running a different LLM, using structured JSON messages between stages. This adds a full inference call per verification point rather than per-hop metadata tracking. For a 5-hop pipeline, three stacked reviewers could cost three additional model invocations, while CHARM’s provenance overhead is roughly 1 second of latency across the same pipeline. OVON’s weakness is that the reviewers themselves can hallucinate, and there is no cross-hop provenance chain linking original sources to final output.

What types of cascades does CHARM’s 10.6% miss rate likely represent?

The paper does not isolate failure modes of its undetected cascades, but the DAG model suggests two structural candidates. First, slow-accumulation paths where per-hop propagation probability stays below θ at each individual edge but the cumulative product crosses the threshold only after many hops, by which point the cascade has already produced output. Second, fabricated claims that are plausible enough to pass stage-level fact verification because the retrieval itself returned a misleading document. In both cases, no mid-chain detector can flag what looks correct at every single checkpoint.

Which existing detection tools map to which of the five survey categories?

The comprehensive survey (arXiv:2510.06265) groups tools into five categories: retrieval-based (claim-vs-source comparison), uncertainty-based (model logit confidence), embedding-based (vector similarity between output and retrieved chunks), learning-based (trained classifiers labeling output as factual or fabricated), and self-consistency-based (multi-sample agreement checks). SelfCheckGPT spans the self-consistency and learning categories; RAGAS faithfulness scoring is primarily retrieval-based. The survey finds that hybrid combinations of categories consistently outperform any single approach, which is consistent with why CHARM’s multi-component architecture outperforms output-only checks.

How does cascade risk scale as agentic pipelines grow beyond 10 hops?

Latency scales linearly (a 20-hop pipeline adds over 4 seconds of CHARM overhead before resolution costs), but audit complexity grows closer to exponentially because the number of distinct error-propagation paths through the DAG grows combinatorially with node count. Teams building long-running agents for legal research or multi-domain analysis may need to partition a monolithic chain into audited sub-chains of 5 to 7 hops, each with its own threshold θ, rather than running a single unbroken DAG. This mirrors checkpointing in distributed computation: it trades some coverage for bounded worst-case latency and simpler provenance state.