Two events in the past month signal a structural shift in how self-hosted LLM inference clusters should be architected. The KV Packet paper (arXiv 2604.13226, April 14, 2026) proposes context-independent KV caching that eliminates recomputation entirely across requests. And on March 24, 2026, llm-d was donated to the CNCF Sandbox at KubeCon Europe, formalizing KV-cache-aware routing as a Kubernetes infrastructure primitive backed by IBM Research, Red Hat, Google Cloud, CoreWeave, and NVIDIA. The KV cache is no longer just a per-node optimization — it is becoming a distributed resource that must be managed at the cluster level.
Why KV Cache Is Outgrowing Per-Node Memory
Standard vLLM deployments treat the KV cache as a local concern: each pod manages its own GPU memory, requests are load-balanced across replicas without regard to what each pod has cached, and any repeated document is recomputed from scratch on whichever pod receives it. This works when requests are independent and cache reuse is incidental.
The model breaks under two conditions that are increasingly common: multi-tenant deployments where many users share a system prompt or RAG corpus, and long-context workloads where the prefill cost of repeated documents dominates latency. When those conditions hold, the per-node model wastes compute on recomputation and leaves GPU memory undersized relative to actual demand.
The infrastructure response has two layers: the algorithm level (can KV tensors be reused across requests without recomputing them?) and the routing level (can the scheduler route requests to pods with warm caches?). KV Packet addresses the first; llm-d addresses the second.
How KV Packet Eliminates Recomputation: The Soft-Token Adapter Mechanism
Standard prefix caching (the mechanism already in vLLM) can reuse KV states for the exact same prompt prefix. The limitation is that KV tensors are context-dependent: they encode not just the document itself but its position in the sequence. If a document appears at position 0 in one request and position 1,500 in another, the KV states are different and cannot be shared.
KV Packet sidesteps this by wrapping each cached document in a pair of trainable soft-token adapters — 8 “Header” vectors prepended and 8 “Trailer” vectors appended — trained offline via self-supervised knowledge distillation1. These adapters condition the model to treat the cached document’s KV states as position-independent. The adapter storage cost is O(1): a few kilobytes per document, regardless of document length1.
The result is that a KV tensor computed once for a given document can be reused verbatim in any context, any position, by any pod that has loaded it. Cross-request, cross-session, and potentially cross-pod reuse all become feasible without touching the transformer forward pass for the cached content.
KV Packet Performance in Practice: What the Numbers Mean
According to the paper2, KV Packet reduces FLOPs by 5–6 orders of magnitude versus full recomputation, with a computed ratio of 6.50×10⁻⁶ to 1.04×10⁻⁵. In terms of time-to-first-token (TTFT), the paper reports:
- 1.36×–3.3× TTFT reduction on Biography and HotpotQA benchmarks against Llama-3.1 and Qwen2.5
- 19.45× TTFT reduction on Needle-in-a-Haystack
- 5.81× TTFT reduction on MusiQue
F1 scores are comparable to full recomputation on most tasks. The paper acknowledges a gap on the Qwen/MusiQue combination and describes it as a “favorable Pareto trade-off” between recomputation cost and answer quality2. That characterization is reasonable for latency-sensitive deployments but warrants direct review of the paper if accuracy on multi-hop QA is a hard requirement.
KV Packet also outperforms selective-recompute methods (CacheBlend, EPIC, SAM-KV) on latency, while matching the No-Recompute baseline that represents the theoretical ceiling2.
What llm-d Does: KV-Cache-Aware Routing as a Kubernetes Primitive
llm-d, accepted into the CNCF Sandbox on March 24, 20263, solves the routing half of the problem. Its core component is a KV-Cache Indexer that maintains a global, near-real-time view of KV block locality across a fleet of vLLM pods4. The indexer consumes KVEvents — metadata about block creation and eviction — from each vLLM instance and scores candidate pods by cache-hit ratio when a new request arrives.
Integration into existing clusters runs through the Kubernetes Gateway API Inference Extension, implemented as an Endpoint Picker — the seam where llm-d plugs into vLLM deployments behind a Kubernetes ingress5.
Beyond cache-aware routing, llm-d also supports prefill/decode disaggregation (separating the prefill and generation phases across different pod pools) and cache-aware LoRA routing, which prevents redundant adapter kernel execution when multiple tenants use different fine-tuned variants of the same base model5.
llm-d v0.5: Three-Tier Offloading and What the Benchmarks Show
llm-d v0.5, released February 4, 2026, introduced three-tier hierarchical KV offloading: GPU memory → CPU RAM → filesystem6. The motivation is straightforward: GPU memory is the most constrained resource in an inference cluster, and a KV cache that spills to CPU or NVMe retains reusability without consuming GPU headroom.
According to llm-d’s own benchmarks on Llama-3.1-70B with 250 concurrent users, KV offloading achieved approximately 185,000 tokens/sec, a 13.9× improvement over GPU-only baseline6. On Qwen3-32B across 8 vLLM pods on H100 GPUs, v0.5 routing delivered 4,500–11,000 output tokens/sec with P50 TTFT of 136–157ms and 109% higher throughput versus baseline Kubernetes round-robin6. On a Wide-EP topology using B200 GPUs across 32 GPUs, the reported figure reaches approximately 50,000 output tokens/sec6.
These numbers are worth reading carefully. The 13.9× offloading gain and 109% throughput improvement are measured on specific high-end hardware topologies — 8 or more pods, H100 or B200 GPUs — that do not represent a typical self-hosted setup. A team running 2–3 A100 nodes will see different results. The llm-d benchmarks do not include smaller-cluster configurations, and extrapolating these figures to more modest deployments is unverified.
TurboQuant: Where KV Compression Fits in This Stack
TurboQuant (presented at ICLR 2026 by Google Research) occupies a different position in this architecture: it compresses KV tensors in memory rather than routing or reusing them. According to the Google Research blog7, TurboQuant achieves 6× KV cache memory reduction via 3-bit key / 2-bit value online vector quantization, with no training required and a claimed zero accuracy loss.
The “zero accuracy loss” claim should be verified against the ICLR paper directly. More immediately relevant for operators: no official Python implementation has been released as of April 2026; Google has indicated a Q2 2026 release7. Until that ships, TurboQuant is not actionable.
If it does ship as described, TurboQuant would compose well with both KV Packet and llm-d: smaller KV tensors mean more cache fits in the same GPU memory tier, increasing cache hit rates for the llm-d router and potentially widening the practical window for KV Packet’s adapter-based reuse.
Capacity Planning Implications for vLLM Teams
The combined picture suggests a different set of questions for infrastructure planning than the current GPU-sizing calculus:
Routing intelligence, not just memory size, determines throughput. llm-d’s benchmarks show identical hardware producing 109% more throughput with cache-aware routing. Adding GPUs without addressing routing locality returns diminishing value.
Tiered storage changes the cost model. Three-tier offloading (GPU → CPU → filesystem) means effective KV cache size can exceed GPU VRAM. The relevant capacity question shifts from GPU memory size to cache hit rate at each tier and the latency cost of a miss.
KV Packet converts document caches from per-request costs to shared assets. For deployments with a stable shared corpus, offline adapter training is a one-time investment that removes prefill cost for those documents across all subsequent requests.
Workload locality determines whether any of this matters. Cache-aware routing only outperforms round-robin when requests have overlapping cache content. Deployments with low document repetition and diverse user prompts will see minimal benefit. The 109% throughput gain assumes meaningful locality; workloads without it will not replicate that figure.
What to Watch: CNCF Sandbox Caveats and the Adapter Training Gap
Two framing risks appear repeatedly in coverage of these projects and are worth naming directly.
The CNCF Sandbox designation explicitly means llm-d is “not yet widely tested in production” and “bleeding edge”3. Sandbox is the lowest of three CNCF maturity tiers. The vendor framing — IBM, Red Hat, and Google positioning llm-d as a replacement for round-robin load balancing — reflects roadmap ambitions, not current production readiness. For teams evaluating a near-term deployment, this distinction matters.
For KV Packet, the gap between benchmarks and operational reality is the adapter training pipeline. The 19.45× TTFT reduction on Needle-in-a-Haystack assumes a pre-cached corpus with pre-trained adapters — getting there requires knowledge distillation over the target document set. Teams with frequently updated document stores need to factor in adapter retraining cadence as part of the operational model.
FAQ
Does llm-d require modifying vLLM itself?
llm-d integrates with vLLM through the Kubernetes Gateway API Inference Extension acting as an Endpoint Picker — a standard Kubernetes ingress extension point5. vLLM must be configured to emit KVEvents (block creation and eviction metadata) that the KV-Cache Indexer consumes. This is a configuration change, not a fork or patch of vLLM, but it does require vLLM versions that support KVEvent streaming.
Can KV Packet and llm-d be used together?
They target different layers and are not mutually exclusive. KV Packet eliminates recomputation when cached KV tensors are reused; llm-d routes requests to pods that already hold those tensors. Together, KV Packet expands reusable cache entries by making them context-independent while llm-d maximizes hit probability. The operational constraint is that KV Packet requires per-corpus adapter training before reuse is possible.
What happens to TurboQuant’s “zero accuracy loss” claim under independent evaluation?
As of April 2026, no independent replication has been published. The claim originates from Google Research’s ICLR 2026 paper and blog post7; the “zero accuracy loss” characterization should be read against the specific benchmarks and precision levels defined in the paper. No official implementation is available yet for external testing.
Footnotes
-
KV Packet arXiv abstract, arXiv 2604.13226, April 14, 2026. https://arxiv.org/abs/2604.13226 ↩ ↩2
-
KV Packet full paper, arXiv HTML, April 14, 2026. https://arxiv.org/html/2604.13226 ↩ ↩2 ↩3
-
CNCF Blog, “Welcome llm-d to the CNCF,” March 24, 2026. https://www.cncf.io/blog/2026/03/24/welcome-llm-d-to-the-cncf-evolving-kubernetes-into-sota-ai-infrastructure/ ↩ ↩2
-
llm-d Documentation, KV Cache Architecture. https://llm-d.ai/docs/architecture/Components/kv-cache ↩
-
IBM Research Blog, “Donating llm-d to the Cloud Native Computing Foundation.” https://research.ibm.com/blog/donating-llm-d-to-the-cloud-native-computing-foundation ↩ ↩2 ↩3
-
llm-d Blog, “llm-d 0.5: Sustaining Performance at Scale,” February 4, 2026. https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale ↩ ↩2 ↩3 ↩4
-
Google Research Blog, “TurboQuant: Redefining AI Efficiency with Extreme Compression.” https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ ↩ ↩2 ↩3