GraphRAG vs VectorRAG: Does the Graph Index Earn Its Cost?

A preprint from Galiński et al., updated June 8, finds that plain vector retrieval matches or beats standard GraphRAG on end-to-end QA while costing a fraction of the index build. If the finding holds beyond the authors’ benchmark set, the burden of proof flips: GraphRAG becomes something you justify with data from your own corpus, not the default upgrade path from naive RAG.

What the UnWeaver Paper Actually Claims

The paper, titled “UnWeaving the knots of GraphRAG — turns out VectorRAG is almost enough” (arXiv:2603.29875), is a direct challenge to the architecture’s value proposition. On end-to-end QA evaluation, the authors report that VectorRAG performs better than standard GraphRAG and nearly matches current state-of-the-art graph-based solutions, for a fraction of the cost. The paper’s mechanism, UnWeaver, doesn’t build an explicit graph at all. It disentangles document contents into entities via LLM and uses those entities as an intermediate way of recovering original text chunks, preserving fidelity to the source material. The idea is to keep vector-RAG speed while filtering noise that vanilla chunk-and-embed misses.

The GraphRAG-Bench team, whose benchmark paper was accepted at ICLR 2026, acknowledges the problem in their own framing: “recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks.” Their benchmark exists precisely to identify when graph structures provide measurable benefits, which is an implicit admission that the answer is “not always.”

Two caveats before running with the headline. First, the UnWeaver paper is a single preprint contesting a popular architecture; its claims have not yet been independently replicated. Second, the authors’ benchmark set, while broader than a single-domain test, is still one evaluation suite. Per their own framing, every comparative number in this article is drawn from the authors’ measurements, not from an independent reproduction.

The Complementarity Finding: RAG for Facts, GraphRAG for Reasoning

The most rigorous comparative work to date comes from Ma et al. at arXiv:2502.11371 (authors from Michigan State, Oregon, UT Arlington, Meta, and IBM). Their systematic evaluation found no consistent winner. Instead, the two approaches show distinct strengths across different tasks and evaluation perspectives.

This is a more nuanced picture than the “GraphRAG is better” narrative that dominates vendor marketing, but it also complicates the “GraphRAG is unnecessary” reading of the UnWeaver paper. The two approaches capture different query distributions; which one wins depends on the workload.

Ma et al. also flag evaluation biases in their analysis, noting that comparative accuracy measurements are vulnerable to methodological distortions across different evaluation protocols. Any accuracy comparison that relies on LLM judging should be read with that caveat.

Where GraphRAG Still Wins: Aggregation and Cross-Document Computation

The cost debate has a blind spot: GraphRAG’s undisputed advantage is on queries that require computation across documents, not just retrieval from them.

AimMultiple’s benchmark on 3,904 Amazon electronics reviews makes this concrete. On aggregation queries (counting, grouping, summing across reviews), Graph RAG retrieves relevant results 3× more often than Vector RAG (23% vs 8% retrieval accuracy). The advantage comes from pre-computed aggregation: the graph traverses entity relationships in a single query, returning results grouped by category. The vector index has no equivalent mechanism.

On specific document search, the roles reversed. Vector RAG outperformed Graph RAG 54% to 35% in the same benchmark. Graph RAG is best understood as a computation layer on top of vector search for aggregation-heavy workloads, not a replacement for it.

The AimMultiple benchmark is limited to one domain (electronics reviews) and one schema, so generalizing from it requires caution. Different corpora with different query distributions will produce different tradeoffs. But the structural point holds: if your queries involve “how many,” “which is the most,” or “summarize across all,” a graph with a query language attached has a real advantage that vector similarity search cannot replicate.

The Cost Arithmetic: Build Overhead vs Marginal Recall Lift

GraphRAG’s construction pipeline carries what the UnWeaver authors characterize as “orders of magnitude increased componential complexity” compared to vector indexing, including named-entity recognition, triple extraction, entity resolution, and community summarization. This is not a controversial claim; it is a description of what the pipeline does.

Microsoft’s own GraphRAG repository warns that “GraphRAG indexing can be an expensive operation” and notes the project is “a demonstration and is not an officially supported Microsoft offering.” When the vendor tells you it’s expensive and unsupported, the default assumption should not be that it’s cheap and production-ready.

The cost has dropped. According to analysis on Medium’s Graph Praxis blog, indexing a 5 GB dataset fell from roughly $33,000 in early 2024 to about $33 by mid-2025, a reduction to 0.1% of the original cost. Impressive, but that $33 is still roughly an order of magnitude more than the “few dollars” the same analysis quotes for vector search on the same data. The cost cliff made GraphRAG affordable; it did not make it cheaper.

The economic question is not whether GraphRAG works. It is whether the marginal recall lift on your query distribution justifies 10× the indexing cost plus the ongoing maintenance of an entity-resolution and community-summarization pipeline. For workloads dominated by factual retrieval, the UnWeaver evidence suggests the answer is no. For workloads heavy on aggregation and cross-document reasoning, the AimMultiple data suggests it may be yes.

Hybrid Strategies: Routing Queries Instead of Picking a Side

Ma et al. tested two hybrid approaches that avoid the binary choice entirely. Their Selection strategy routes each query to RAG or GraphRAG based on query type. Their Integration strategy combines evidence from both systems before generation. Both yielded consistent improvements across benchmarks, though the confidence on this finding is medium in the brief, as specific lift numbers were not provided with high confidence.

The practical implication: if you have already built both indexes, a router is cheaper to implement than re-architecting around one approach. The router does not need to be sophisticated. Query classification (factual vs. reasoning vs. aggregation) is a tractable problem, and misclassification degrades to the weaker system rather than failing outright.

The hybrid approach also sidesteps evaluation methodology concerns. A system that uses both retrievers and lets the query type determine the source is less vulnerable to any single evaluation protocol’s blind spots.

What This Means for Your RAG Roadmap

If your team is currently on vector RAG and considering GraphRAG as an upgrade, the UnWeaver paper gives you cover to demand proof rather than assuming the graph is better. Run a side-by-side evaluation on your actual query mix before committing to the build pipeline. The benchmark does not need to be elaborate. A few hundred representative queries, scored by both systems, will tell you more about your specific tradeoff than any vendor blog post.

If you are already running GraphRAG in production, the cost question runs in the other direction: measure what fraction of your query volume actually exercises the graph’s advantages. If the majority of your traffic is factual retrieval where vectors match or outperform, you are paying graph-construction costs to serve queries that don’t need the graph. A router that sends only aggregation and multi-hop queries through the graph pipeline could cut your indexing costs proportionally.

If you are building from scratch, start with vectors. The UnWeaver evidence, the Ma et al. complementarity finding, and the AimMultiple retrieval numbers all point in the same direction: vector retrieval is the baseline, and the graph is an add-on for specific query types, not a replacement. Build the graph when you have evidence that your workload needs it, not because the vendor documentation puts it in the “advanced” column.

The honest answer to the title question is: it depends on your queries. But the prior has shifted. The default assumption is no longer that GraphRAG is better. The default assumption is that vector retrieval is good enough until you prove otherwise, and the graph has to earn its cost on your data, not on the vendor’s benchmark.

Frequently Asked Questions

Does the UnWeaver finding apply to multi-hop reasoning queries?

Probably not across the board. Han et al. (arXiv:2502.11371) found that RAG and GraphRAG target different query distributions: RAG wins on single-hop factual lookups, while GraphRAG outperforms on multi-hop reasoning and corpus-level summarization. UnWeaver evaluated end-to-end QA, so workloads dominated by multi-hop or cross-document reasoning may still justify the graph even if UnWeaver’s general QA numbers favor vectors.

Why should I distrust vendor-reported GraphRAG accuracy numbers?

Han et al. demonstrated that LLM-as-judge evaluations are highly sensitive to the presentation order of candidate answers, producing strong position effects that inflate whichever system appears first. Vendor benchmarks like the Analog AI comparison reporting LightRAG at ~95% correctness on multi-hop reasoning use self-selected test sets and LLM judges with no disclosed randomization of candidate order. Treat these as upper bounds, not settled comparisons.

How does UnWeaver’s approach differ from bolting entity extraction onto a vector pipeline?

UnWeaver concatenates per-chunk entity descriptions directly into the embedding vectors before indexing, rather than constructing a separate graph that requires traversal at query time. Retrieval stays a single vector similarity search with no Cypher or graph-walk overhead, but the embeddings carry entity-level signal that plain chunk-and-embed misses. The tradeoff: you get richer retrieval without graph maintenance, but you lose the structured query capabilities (counting, grouping, traversal) that a true graph provides.

What query-mix split should I measure before committing to GraphRAG?

Classify your production queries into three buckets: factual lookup, aggregation or computation across documents, and multi-hop reasoning. The factual bucket is where vectors match or beat the graph per UnWeaver and the AimMultiple data. The aggregation bucket is where pre-computed graph traversals hold a structural advantage (23% vs 8% retrieval on aggregation queries in the AimMultiple benchmark). If more than roughly two-thirds of your traffic falls in the factual bucket, the graph’s indexing and entity-resolution overhead is unlikely to pay back.