Cost-Aware RAG Routing: When Deeper Retrieval Stops Paying Off

Q: What weight configuration would a latency-sensitive deployment use?

The sensitivity analysis in the paper shows that setting wL to 0.5 (up from the default 0.2) and correspondingly reducing wQ and wC shifts the router toward faster bundles without changing the bundle catalog. A cost-sensitive configuration would instead raise wC to 0.5. Both profiles still use the same four bundles; the operating point moves through weight adjustment alone. This matters operationally because a team can tune the router's behavior by changing three numbers in a config file rather than redefining retrieval strategies.

Most RAG deployments treat retrieval depth as a static config value. Set top_k=10, ship it, move on. The assumption baked into that default is that more retrieved context cannot hurt answer quality, so the only question is how much latency you’re willing to tolerate. CA-RAG (Cost-Aware Query Routing in RAG), a preprint posted to arXiv cs.IR on June 3, 2026, measures what actually happens when you crank that knob: past a query-dependent threshold, pulling more chunks raises token cost and latency without improving answers, and on some query classes it degrades them. The paper puts numbers on a tradeoff that production teams have been guessing at, and the numbers argue for a routing layer that decides retrieval depth per query rather than treating it as a global constant.

Why Cranking top-k Hurts

The intuition behind high top-k is straightforward: more context gives the generator more material to work with. In practice, two things erode that assumption. First, every additional retrieved chunk adds tokens to the prompt, and every token costs money. Second, language models do not reliably distinguish signal from noise across long retrieved contexts; additional low-relevance chunks can dilute attention on the actually useful passages.

CA-RAG’s per-query delta analysis makes this concrete. Definitional queries like “What is RAG?” receive no quality benefit from heavy retrieval, yet they incur the full token cost of embedding, retrieving, and injecting ten dense chunks into the prompt. Analytical prompts, conversely, are underserved by shallow retrieval. A single static top-k cannot resolve this across a heterogeneous workload because the queries that waste budget under heavy retrieval are not the same queries that suffer under light retrieval.

The Bundle Catalog and Utility Function

CA-RAG frames the routing problem as a choice among four fixed strategy bundles, each a complete retrieval-generation pipeline:

direct_llm: no retrieval. The query goes straight to the generator with no context injection.
light_rag: minimal retrieval depth.
medium_rag: moderate retrieval depth.
heavy_rag: top-k=10 dense retrieval, the default production setting most teams ship with.

All four bundles share a fixed generation profile. The only variable is how much retrieved context the prompt contains.

The router picks a bundle per query by maximizing a scalar utility function:

U_b = w_Q · Q̂_b(q) − w_L · L̂_b^norm − w_C · Ĉ_b^norm

The default weights privilege quality over latency and cost. The sensitivity analysis confirms that shifting the weight profile toward latency or cost moves the operating point without redesigning the bundles, and the same four bundles serve multiple cost-latency-quality regimes through weight adjustment alone.

What the Numbers Say

On a 28-query benchmark using FAISS-backed dense retrieval with OpenAI embedding and chat APIs, the CA-RAG router achieved:

26% fewer billed tokens than always-heavy retrieval.
34% lower mean latency than always-direct inference.
Equivalent answer quality to the heavy-retrieval baseline.

The 26% token savings and 34% latency reduction are not uniformly distributed. They concentrate on simpler queries where heavy retrieval was overspending. Complex analytical queries still receive heavy retrieval under the router, so quality does not degrade on the queries that need the context.

The cost model accounts for total billed tokens per query, tying retrieval depth directly to API cost. This includes embedding compute at query time, a cost most analyses omit but managed vector-search providers charge for.

The engineering difficulty is not in the utility function; it is in the signal that feeds it. The router needs to estimate quality, latency, and cost for each bundle before retrieval happens, using only the query text.

CA-RAG derives a heuristic complexity score from features of the query itself:

Query length, as a proxy for retrieval scope.
Interrogative cue-word count, as a proxy for structural complexity.

This score modulates the quality priors for each bundle without requiring an additional LLM call. The design choice is deliberate: any pre-retrieval signal that itself requires an LLM invocation would eat into the cost savings it is trying to enable.

The tradeoff is brittleness. A word-count heuristic calibrated on English technical queries will not transfer cleanly to domain-specific jargon, multi-language workloads, or queries where complexity is structural rather than lexical. A four-word query like “Explain the proof” reads as low-complexity by this metric but demands heavy retrieval in most corpora. The paper does not evaluate these edge cases.

The Billing Implication for Managed Vector Search

For teams running RAG on managed vector-search endpoints, the per-query cost structure that CA-RAG optimizes maps directly to vendor pricing. Pinecone, Weaviate, and similar services bill on a combination of stored vectors, queried vectors, and retrieved data volume. Every top-k=10 retrieval costs more than a top-k=3 retrieval. A routing layer that skips retrieval entirely on easy queries, and caps depth on medium-complexity queries, changes the unit economics of a RAG endpoint more than any embedding-model swap.

The magnitude depends on query-mix composition. If a substantial fraction of production queries are definitional or simple lookup, a corresponding share of retrieval spend is recoverable with no quality loss. The paper’s 26% aggregate savings imply a workload skewed toward simpler queries, which aligns with typical enterprise RAG deployments where the query distribution is heavy-tailed toward straightforward lookups.

Converging Work on Routing and Caching

CA-RAG is not the only routing paper landing in this window. Two independent efforts address adjacent parts of the same cost-control problem.

The vLLM Semantic Router (arXiv:2603.04444, updated to v4 as of June 2026) composes 13 heterogeneous signal types through Boolean decision rules into deployment-specific routing policies. It includes a three-stage HaluGate hallucination detection pipeline and targets model selection across Mixture-of-Modality deployments. Where CA-RAG routes on retrieval depth, vLLM Semantic Router routes on model choice. The two axes are complementary: a production system could route both retrieval depth and model selection simultaneously.

GroundedCache (arXiv:2605.27494) addresses a different cost lever: answer reuse. Its evidence-validated cache router runs four gates (query similarity, evidence overlap, source-version validity, and lexical support) before serving a cached answer. On Qwen2.5-7B-Instruct with vLLM, it drives the unsafe-served rate to 0.0% on HotpotQA versus 15-35% for naive caching, and to 1.5% on mtRAG document drift versus 51.5% for naive caching, with p50 latency within 1.04-1.07x of the no-cache baseline. The implication: caching and retrieval-depth routing can stack, and neither subsumes the other.

Caveats and What Is Missing

The CA-RAG evaluation has a deliberate limitation. The benchmark uses a small fixed corpus. The query set is 28 questions. The paper acknowledges this as a design choice to isolate routing behavior from corpus-scale retrieval effects, but it limits generalizability in two ways: retrieval quality at k=10 over a million-chunk corpus behaves differently than over a small one, and the router’s complexity heuristic was validated on a narrow query distribution.

The evaluation uses OpenAI chat and embedding APIs. Models with different context windows or attention patterns over retrieved passages may show different quality curves as top-k increases. The results should be read as measuring the routing mechanism, not as a universal statement about retrieval depth across all model generations.

The quality priors are hand-specified. A production deployment would need to log per-query outcomes and recalibrate the priors from telemetry, which means the routing system needs a feedback loop the paper does not build. The framework is there; the operational plumbing is not.

What the paper does establish is a framing that production teams should adopt: retrieval depth is a per-query decision, not a global config value, and the cost function that governs it includes token billing terms that scale directly with every additional retrieved chunk. The router does not need to be perfect to pay for itself. It needs to beat the fixed-top-k baseline, and on this benchmark it does.

Frequently Asked Questions

How would a team build the feedback loop the paper leaves unbuilt?

CA-RAG frames routing as a contextual-bandits problem but runs with exploration disabled and hand-specified quality priors. A production deployment would need to log which bundle was selected per query, record a quality outcome (human rating or automated metric), and feed those reward signals back into the prior estimates. The generation profile in the paper is fixed at 256 max output tokens and temperature 0 with tiktoken for counting, so the telemetry schema is straightforward: query text, chosen bundle, billed tokens split into prompt, completion, and embedding components, and a quality score.

Can retrieval-depth routing and semantic caching stack on the same pipeline?

Yes, and they target separate cost drivers. CA-RAG reduces spend on queries that do reach retrieval. GroundedCache eliminates retrieval entirely on cache hits by running four validation gates (query similarity, evidence overlap, source-version validity, lexical support) before serving a stored answer. A combined pipeline would route through the cache first; on a miss, the retrieval-depth router would then select a bundle. GroundedCache’s measured p50 latency overhead of 1.04 to 1.07x over a no-cache baseline is small enough that it does not negate CA-RAG’s latency gains.

What specific queries would misroute under the complexity heuristic?

The heuristic weights query length at 0.6 (capped at 20 tokens) and interrogative cue-word count at 0.4 (capped at 3). A query like “Compare the convergence guarantees of Adam and LAMB” scores low on cue words (zero interrogatives) despite requiring multi-document retrieval and cross-referencing. Code debugging prompts (“Why does this segfault?”) score one cue word but short length, routing them toward light retrieval when they often need dense context. Multi-hop reasoning questions suffer similarly: their complexity is structural, not lexical.

Do the savings hold with frontier models that have larger context windows?

The benchmark ran on gpt-3.5-turbo, a model with a 4k token context window. Models with 128k+ windows (Claude, GPT-4o) may show different attenuation patterns: their attention mechanisms are trained to handle long retrieved context without the same dilution CA-RAG measures, which could compress the quality gap between heavy and light retrieval bundles. Conversely, if noise chunks still capture attention proportionally in longer contexts, the penalty from over-retrieval could be larger because there are more tokens to distract from signal. The paper does not test across model generations.

What weight configuration would a latency-sensitive deployment use?

The sensitivity analysis in the paper shows that setting w_L to 0.5 (up from the default 0.2) and correspondingly reducing w_Q and w_C shifts the router toward faster bundles without changing the bundle catalog. A cost-sensitive configuration would instead raise w_C to 0.5. Both profiles still use the same four bundles; the operating point moves through weight adjustment alone. This matters operationally because a team can tune the router’s behavior by changing three numbers in a config file rather than redefining retrieval strategies.