groundy
models & research

Do Multimodal RAG Models Ignore Late Evidence? A Primacy Bias Test

Multimodal RAG readers lose 16 to 26 percentage points when the correct evidence sits at the end of context, and standard rerankers do not close the gap.

7 min···3 sources ↓

Yes, by a wide margin. On three open 7B/8B vision-language readers run over a deployed knowledge-base VQA pipeline, placing the gold (correct) passage at the start of the retrieved context rather than the end raises accuracy by 16 to 26 percentage points, and the standard retrieval-side fixes leave that gap untouched. “Lost at the End” (arXiv:2606.16494), a preprint whose v2 landed 26 June 2026, reports the result and reframes the established U-shaped position-bias curve as a primacy curve for deployed multimodal RAG.

How does the probe isolate position from retrieval?

This is a controlled probe of reader-side position dependence, not a benchmark horse race. The pipeline mirrors a deployed KB-VQA stack: a frozen retriever returns a pool of candidate passages per query from a Wikipedia-scale knowledge base, and a frozen, greedily-decoded vision-language reader answers from the image, the question, and a k-passage subset of those candidates, with k up to 20. Three open-source 7B/8B vision-language readers are tested across two KB-VQA benchmarks (v1 abstract).

The instrument is a gold-position protocol. For each question the authors build three prompts that are identical except for the index of the gold passage: slot 0 in the “first” cell, the middle slot in the “middle” cell, and the last slot in the “last” cell, with distractors identical in identity and ordering across cells. Because nothing varies but gold’s index, the position effect is a within-prompt permutation, which lets the authors isolate the reader’s response to position from retrieval, scoring, and prompt-composition confounds.

How big is the primacy gap?

Accuracy falls with gold position. Across every reader-by-benchmark cell, gold-at-first beats gold-at-last by 16 to 26 percentage points (v1 abstract).

Where long-context text LLMs lose the middle and recover the end, these multimodal readers lose the end: the shape is primacy, not a U, and it holds across the readers and benchmarks tested.

Why do rerankers and diversification fail to fix it?

On a frozen reader, three obvious, cheap, training-free retrieval-side fixes all fail to close the gap (v1 abstract). MMR diversification, oracle reranking (forcing the gold passage into a chosen slot), and rank-based distractor reordering each leave the gap intact, with no separable improvement. The usable lesson is narrower than “rerankers are useless”: ordering fixes that operate upstream of the reader do not move the needle.

Where does the bias actually live?

When the reader is wrong, the answer it produces tends to come from whatever passage sits at the front of the prompt. Two ablations pin the locus to prompt slot 0 of the instruction-tuned reader (v1 abstract).

An image-position ablation tests whether primacy is driven by image-token proximity to early passages: if it were, moving the image from the start of the prompt to the end should shrink the gap. It does not, so image-token proximity is rejected as the primary driver. A distractor-shuffle ablation then dissociates slot from similarity, showing the reader follows the slot, not the retrieval rank.

The multimodal setting is not inventing this bias from nothing. A text-only control shows primacy already present in pure text, and the multimodal setting amplifies it by 2.2 to 4.5 times (v1 abstract). The authors flag a caveat: the text-only and multimodal settings draw on different corpora, so modality and corpus are confounded, and the amplification should be read as a property of the multimodal KB-VQA setting rather than of modality alone.

Does this refute the U-shape from Yao et al.?

Not a refutation, a scoping correction. Yao et al. (2025), “Who is in the Spotlight”, a 2025 preprint, reported a U-shaped accuracy curve with respect to evidence position, introduced a Position Sensitivity Index to quantify the bias, and found that multimodal interactions intensify it relative to unimodal settings and that it grows logarithmically with retrieval range (abstract).

The two results can coexist because the settings differ. Yao et al. measured general multimodal comprehension; “Lost at the End” measures deployed KB-VQA with a controlled gold-position protocol. The new paper attributes the differing shape to setting and distractor scale, not to a measurement error in the earlier work. Where they agree is the part practitioners should note: multimodal conditioning intensifies an already-present text-mode positional bias.

What actually mitigates primacy bias?

The clearest training-free mitigation in the literature routes retrieval by cross-modal uncertainty before fusing evidence. Self-Aware-MRAG (OpenReview, under review) uses per-evidence uncertainty to decide whether to skip retrieval or pull text, image, or both, then applies relevance-guided reordering and adaptive decay reweighting at fusion. Across OK-VQA and four MRAG benchmarks it reports +17.1 pp attribution precision and a 49.6% reduction in position bias over its strongest competitor (OpenReview abstract). Those are author-reported numbers from an under-review submission, unreproduced here; treat them as a design direction, not a benchmark.

The reader’s own mechanism finding implies a cheaper heuristic: if you must place high-value evidence somewhere privileged, slot 0 is where the reader actually looks. The paper also flags reader-side interventions it deliberately did not test as natural next steps, among them fine-tuning against position bias, attention calibration, attention-based reranking, and permutation-aware listwise ranking. The common thread is that the lever is on the reader side, not the retriever side.

Engineering takeaway: retire recall@k for deployed KB-VQA

The paper’s design argument is that recall@k is the wrong metric for deployed KB-VQA because it hides slot-sensitivity (v1 abstract). A retriever can surface the gold passage among its top candidates and still lose, because the reader will not use it unless it lands near the front of the prompt. Recall measures whether retrieval finds the evidence; it says nothing about whether the reader can act on where the evidence landed.

The practical reframe is to treat evidence order as a first-class, reader-side variable rather than an afterthought of top-k retrieval. That means budgeting for reader-side intervention (confidence-ranked slot-0 placement, position-aware fusion, uncertainty-routed retrieval) instead of assuming a wider context window or a stronger reranker absorbs the problem. The authors release the gold-position protocol as a controlled instrument for evaluating exactly these interventions, so the cost of testing your own pipeline is low.

Frequently Asked Questions

Which specific readers, benchmarks, and retriever did the primacy probe use?

Three instruction-tuned open readers: Qwen2.5-VL-7B, InternVL3-8B, and Qwen3-VL-8B, run over the InfoSeek (98K-passage) and E-VQA (51K-passage) knowledge bases, with PreFLMR ViT-G returning the top-50 candidates per query. That gives six reader-by-benchmark cells, and the primacy gap appears on every one.

How does the 16-26 pp gap compare to the position bias Yao et al. reported?

Yao et al.’s Position Sensitivity Index ranged 2-11 pp across MS-MARCO, ChartQA, and VEGA at 2-19 distractors, versus the new probe’s 16-26 pp gap at top-50 retrieval with k up to 20. The larger spread is consistent with the logarithmic distractor scaling Yao et al. themselves reported, so the two figures sit on the same curve rather than contradicting each other.

Does the primacy gap hold on GPT-4o or other frontier readers?

The gold-position protocol ran on three open 7B/8B readers only, so frontier behavior is untested. Yao et al.’s earlier multimodal study did evaluate GPT-4o alongside Qwen2-VL-7B-Instruct, Llama-3.2-11B-Vision-Instruct, and MINICPM-v2.6 and found position bias present, but none of those models has been probed with the deployed-KB-VQA gold-position instrument.

If recall@k is the wrong metric for KB-VQA, what should teams measure instead?

The released gold-position protocol is the proposed instrument: hold the candidate pool fixed and permute only the gold passage’s slot, then compare reader accuracy at slot 0 against the last slot. A slot-position accuracy curve surfaces the slot sensitivity that recall@k hides, at the cost of roughly one extra inference pass per permutation cell.

Would scaling the reader beyond 8B erase the primacy gap?

The probe did not test beyond 8B, so frontier behavior is open. The authors cite Liu et al.’s text-LLM scaling results as evidence that scale partially flattens position bias in the text setting, and if multimodal readers inherit that trend, the recall@k critique weakens for large frontier readers. That is itself a reason to run the gold-position protocol on bigger models before extrapolating.

sources · 3 cited