Multimodal Knowledge Graph RAG vs Vector RAG: What MKG-RAG-Bench Shows

MKG-RAG-Bench, posted to arXiv on 24 June 2026 and accepted to KDD’26, is the first benchmark to isolate retrieval as the evaluation target when you bolt a multimodal knowledge graph onto a vector baseline. Its central finding reframes the real question teams should be asking: retrieval is the bottleneck in this stack, and off-the-shelf retrievers built for unstructured text struggle once images and structured edges enter the picture. The knowledge-graph-plus-image pipeline, in other words, needs a specific and measured justification rather than assumed superiority.

What does MKG-RAG-Bench actually measure?

Most multimodal RAG benchmarks score the end-to-end answer and stop there. MKG-RAG-Bench isolates the retrieval step. It is built from two multimodal knowledge graphs, one general-domain and one medical, paired with question sets that are structurally grounded in the graph so that each query has exact supervision over what the correct retrieved evidence should be. That construction is what lets the benchmark score retrievers independently of the generator, and it is why the paper treats retrieval as a first-class target rather than a black box buried inside a chain.

The benchmark is not assembled by hand. The authors run an LLM-based curation pipeline that filters low-utility knowledge before it enters the test set, generates queries grounded in graph structure, and covers diverse modality configurations. The medical domain matters specifically because clinical and biomedical knowledge is where the “add more modalities for more factuality” assumption gets stress-tested hardest. A wrong image association in a clinical note is not the same class of error as a wrong image association in a general web document, and the benchmark lets you see which retrievers survive that gap.

Why is retrieval the bottleneck?

The paper’s headline claim, stated plainly, is that effective multimodal retrieval remains “challenging yet crucial” for end-to-end performance, and that retrieval quality strongly determines generation outcomes. The mechanism is that multimodal knowledge is heterogeneous: text, images, tables, and equations do not align cleanly, and the retrievers most teams already operate were tuned for unstructured text corpora. The abstract describes multimodal knowledge as often “poorly served by retrievers designed for unstructured corpora”, a direct way of saying the embedding stack you already have running was not designed for this job.

This is the point that gets lost in vendor marketing. A retriever that performs well on a text-only benchmark is being asked, in the multimodal KG setting, to rank evidence across modalities it was never trained to compare, to traverse structured edges it cannot see, and to resolve cross-modal correspondence that the underlying representations do not guarantee. The benchmark’s contribution is to make that failure visible and measurable, rather than letting it hide inside a single end-to-end accuracy number.

What does multimodal indexing actually cost?

The indexing cost of a multimodal knowledge graph is concrete, and it is the line item most teams under-budget. Consider the construction described in MAHA: nodes represent text, images, tables, equations, and graphs, and the knowledge graph encodes the cross-modal semantics and relationships between them. Each modality brings its own representation path. A vector-only pipeline over text chunks has one embedding model, one index, and one query path. The moment you add typed cross-modal edges and a second modality, you are maintaining a graph traversal layer on top of two or more embedding spaces, reconciling their rankings, and paying the latency of every hop. None of that is free, and none of it appears in the text-only GraphRAG cost numbers that get quoted in comparisons.

Where does the plain vector baseline still win?

For text-heavy, single-modality, passage-shaped workloads, the vector baseline is the right tool and the graph layer is overhead with no payoff. Neo4j’s own comparison frames the tradeoff honestly: graph search adds depth and breadth traversal, which vector similarity cannot provide on its own, but at the cost of building and maintaining the graph layer. If the relevant answer is a contiguous passage and there is no structural relationship worth traversing, vector retrieval is the correct default.

The honest reading of MKG-RAG-Bench is not “vector is dead.” It is that vector retrieval fails in specific, identifiable ways once the question requires joining across modalities or following structured relationships, and that those failures stay invisible until you measure retrieval directly. The two pipelines are not substitutes; they fail differently:

Dimension	Vector RAG baseline	Multimodal KG-RAG
Embedding models	One, text	Multiple, per modality
Index structure	Single ANN index	ANN index(es) plus a graph traversal layer
Evidence join	Semantic similarity	Typed cross-modal edges
Known failure mode	Similar but not relevant	Noisy cross-modal correspondence
Dominant cost	Query latency	Indexing, alignment, multi-embedding storage

Can you reuse text-only GraphRAG wins here?

The most common conflation in this debate is worth answering directly: no, not without qualification. The widely cited GraphRAG cost and accuracy wins are text-only. Microsoft’s GraphRAG uses an LLM to build a knowledge graph from a text corpus and augments queries with community summaries and graph machine-learning output; the reported gains over Baseline RAG concentrate on connecting disparate pieces of information and summarizing broad semantic concepts across text. Separately, data.world’s benchmark reports a roughly 3x uplift in LLM answer accuracy when responses over SQL databases are backed by a knowledge graph. Neither system encoded images, neither maintained a modality-aware graph, and neither paid the alignment cost that MKG-RAG-Bench is testing.

When a vendor deck or a thought-leadership post quotes those numbers next to a multimodal KG-RAG pitch, it is doing the work of selling, not measuring. The gains in text GraphRAG come from hierarchical summarization and explicit relationship traversal over text entities. Both mechanisms transfer only partially to a setting where half the evidence is an image and the other half is an equation, and where the join between them is itself the thing that can be noisy.

Does adding images actually improve factuality?

The assumption that more modalities means more factuality is exactly the assumption MKG-RAG-Bench is built to test, and the surrounding literature gives reason to doubt it. The related mKG-RAG paper, accepted to SIGIR’26, motivates the whole line of work by observing that vanilla RAG-based visual question answering methods “frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability.” That is the baseline MKG-RAG-Bench measures against.

The strongest evidence against the assumption comes from the RULE paper (ICLR 2026), which reports that more than 50% of entities in some multimodal knowledge graph benchmarks are affected by noisy correspondence. The examples are specific: “Mr. & Mrs. Smith” the movie gets conflated with “Will Smith and Mrs. Smith,” and a Cristiano Ronaldo query surfaces the Portuguese flag. If half the multimodal associations in your knowledge graph are wrong, the image you retrieve is as likely to mislead the generator as to ground it. That is not a retrieval-tuning problem. It is a knowledge-graph-construction problem, and it is the reason a benchmark that isolates retrieval is necessary in the first place.

How do you evaluate the tradeoff in production?

The standard IR metrics were designed for search, not for retrieval-augmented generation, and using them to justify a multimodal KG pipeline hides the cost dimension entirely. Practical RAG Evaluation argues that classical rank metrics such as nDCG, MAP, and MRR are a poor fit for RAG, and introduces a Cost-Latency-Quality (CLQ) lens instead. The CLQ frame is the right one for applying MKG-RAG-Bench findings: it forces retrieval quality, end-to-end latency, and the per-query embedding and storage cost onto the same axis, rather than letting a pipeline win on accuracy while it silently loses on cost.

This matters because the multimodal path is strictly more expensive to build, strictly more expensive to operate, and its quality win is conditional rather than guaranteed. A benchmark that reports accuracy alone lets the more expensive pipeline win on its strongest dimension while hiding the dimensions where it loses.

When should you build the multimodal KG pipeline?

The practitioner question is not “is multimodal KG-RAG better than vector RAG.” It is “for which query classes does the added cost buy a retrieval win that survives the noisy-correspondence problem.” Based on what MKG-RAG-Bench exposes, the answer narrows to specific cases: workloads where the answer requires joining a structured relationship that text chunks cannot express, where the relevant image is correctly associated with the entity, and where the latency and indexing cost of the graph layer is justified by a query class the vector baseline fails outright.

Everything else is the vector baseline with a measurement discipline attached. Ship the simpler pipeline first, measure retrieval directly rather than only end-to-end accuracy, and reach for the multimodal knowledge graph only when you can point at a specific query class where the vector store fails and the graph succeeds. The contribution of MKG-RAG-Bench is that it finally makes that comparison possible on numbers, not on vendor assertions.

Frequently Asked Questions

What concrete encoding paths does a multimodal KG add beyond a single text embedding model?

The modality-aware schema requires per-modality encoders that a vector store never touches. Per MAHA, images and embedded graphs are CLIP-encoded and stored as base64, tables are flattened to HTML, and equations are rendered as LaTeX, each as a separate node type with its own embedding path before any retrieval runs.

What figures does the MKG-RAG-Bench abstract leave out?

The abstract publishes no accuracy deltas, latency numbers, or indexing-cost figures; those live in the full paper. Practitioners should pull per-retriever scores from the PDF rather than quoting a headline delta, because no public percentage exists in the abstract to cite.

Is the ‘MKG’ in MKG-RAG-Bench the same as Medical Knowledge Group?

No. MKG here stands for Multimodal Knowledge Graph, not Medical Knowledge Group, which is an unrelated biopharma communications firm. The collision matters because searches for ‘MKG RAG’ surface both, and the medical arm of the benchmark is unrelated to the company of that name.

How does MKG-RAG-Bench differ from mKG-RAG, also accepted in 2026?

mKG-RAG (SIGIR’26) is a system: a multimodal KG-RAG architecture for knowledge-intensive visual question answering. MKG-RAG-Bench (KDD’26) is a benchmark: a test set and protocol that scores retrieval across multiple retriever families and modality settings. The benchmark supplies the scoreboard that proposals like mKG-RAG previously had to assert rather than measure.

Which text-only GraphRAG results get misquoted next to multimodal pitches?

Three figures recur: Microsoft GraphRAG’s 26 to 97 percent token savings over Baseline RAG, data.world’s roughly 3x accuracy lift on 43 business questions, and a reported 28.6 percent time reduction at LinkedIn. All three are text-only; none encoded images or maintained a modality-aware graph, so none bounds what a multimodal pipeline will cost.