KV Cache Is Becoming a Distributed Infrastructure Layer: What KV Packet and llm-d Mean for Self-Hosted LLM Teams

Two events in the past month signal a structural shift in how self-hosted LLM inference clusters should be architected. The KV Packet paper (arXiv 2604.13226, April 14, 2026) proposes context-independent KV caching that eliminates recomputation entirely across requests. And on March 24, 2026, llm-d was donated to the CNCF Sandbox at KubeCon Europe, formalizing KV-cache-aware routing as a Kubernetes infrastructure primitive backed by IBM Research, Red Hat, Google Cloud, CoreWeave, and NVIDIA. The KV cache is no longer just a per-node optimization — it is becoming a distributed resource that must be managed at the cluster level.

Why KV Cache Is Outgrowing Per-Node Memory

Standard vLLM deployments treat the KV cache as a local concern: each pod manages its own GPU memory, requests are load-balanced across replicas without regard to what each pod has cached, and any repeated document is recomputed from scratch on whichever pod receives it. This works when requests are independent and cache reuse is incidental.

The model breaks under two conditions that are increasingly common: multi-tenant deployments where many users share a system prompt or RAG corpus, and long-context workloads where the prefill cost of repeated documents dominates latency. When those conditions hold, the per-node model wastes compute on recomputation and leaves GPU memory undersized relative to actual demand.

The infrastructure response has two layers: the algorithm level (can KV tensors be reused across requests without recomputing them?) and the routing level (can the scheduler route requests to pods with warm caches?). KV Packet addresses the first; llm-d addresses the second.

How KV Packet Eliminates Recomputation: The Soft-Token Adapter Mechanism

Standard prefix caching (the mechanism already in vLLM) can reuse KV states for the exact same prompt prefix. The limitation is that KV tensors are context-dependent: they encode not just the document itself but its position in the sequence. If a document appears at position 0 in one request and position 1,500 in another, the KV states are different and cannot be shared.

KV Packet sidesteps this by wrapping each cached document in a pair of trainable soft-token adapters — 8 “Header” vectors prepended and 8 “Trailer” vectors appended — trained offline via self-supervised knowledge distillation (KV Packet arXiv abstract, arXiv 2604.13226, April 14, 2026). These adapters condition the model to treat the cached document’s KV states as position-independent. The adapter storage cost is O(1): a few kilobytes per document, regardless of document length (KV Packet arXiv abstract, arXiv 2604.13226, April 14, 2026).

The result is that a KV tensor computed once for a given document can be reused verbatim in any context, any position, by any pod that has loaded it. Cross-request, cross-session, and potentially cross-pod reuse all become feasible without touching the transformer forward pass for the cached content.

KV Packet Performance in Practice: What the Numbers Mean

According to the paper (KV Packet full paper, arXiv HTML, April 14, 2026), KV Packet reduces FLOPs by 5–6 orders of magnitude versus full recomputation, with a computed ratio of 6.50×10⁻⁶ to 1.04×10⁻⁵. In terms of time-to-first-token (TTFT), the paper reports:

1.36×–3.3× TTFT reduction on Biography and HotpotQA benchmarks against Llama-3.1 and Qwen2.5
19.45× TTFT reduction on Needle-in-a-Haystack
5.81× TTFT reduction on MusiQue

F1 scores are comparable to full recomputation on most tasks. The paper acknowledges a gap on the Qwen/MusiQue combination and describes it as a “favorable Pareto trade-off” between recomputation cost and answer quality (KV Packet full paper, arXiv HTML, April 14, 2026). That characterization is reasonable for latency-sensitive deployments but warrants direct review of the paper if accuracy on multi-hop QA is a hard requirement.

KV Packet also outperforms selective-recompute methods (CacheBlend, EPIC, SAM-KV) on latency, while matching the No-Recompute baseline that represents the theoretical ceiling (KV Packet full paper, arXiv HTML, April 14, 2026).

What llm-d Does: KV-Cache-Aware Routing as a Kubernetes Primitive

llm-d, accepted into the CNCF Sandbox on March 24, 2026 (CNCF Blog, “Welcome llm-d to the CNCF,” March 24, 2026), solves the routing half of the problem. Its core component is a KV-Cache Indexer that maintains a global, near-real-time view of KV block locality across a fleet of vLLM pods (llm-d Documentation, KV Cache Architecture). The indexer consumes KVEvents — metadata about block creation and eviction — from each vLLM instance and scores candidate pods by cache-hit ratio when a new request arrives.

Integration into existing clusters runs through the Kubernetes Gateway API Inference Extension, implemented as an Endpoint Picker — the seam where llm-d plugs into vLLM deployments behind a Kubernetes ingress (IBM Research Blog, “Donating llm-d to the Cloud Native Computing Foundation.”).

Beyond cache-aware routing, llm-d also supports prefill/decode disaggregation (separating the prefill and generation phases across different pod pools) and cache-aware LoRA routing, which prevents redundant adapter kernel execution when multiple tenants use different fine-tuned variants of the same base model (IBM Research Blog, “Donating llm-d to the Cloud Native Computing Foundation.”).

llm-d v0.5: Three-Tier Offloading and What the Benchmarks Show

llm-d v0.5, released February 4, 2026, introduced three-tier hierarchical KV offloading: GPU memory → CPU RAM → filesystem (llm-d Blog, “llm-d 0.5: Sustaining Performance at Scale,” February 4, 2026). The motivation is straightforward: GPU memory is the most constrained resource in an inference cluster, and a KV cache that spills to CPU or NVMe retains reusability without consuming GPU headroom.

According to llm-d’s own benchmarks on Llama-3.1-70B with 250 concurrent users, KV offloading achieved approximately 185,000 tokens/sec, a 13.9× improvement over GPU-only baseline (llm-d Blog, “llm-d 0.5: Sustaining Performance at Scale,” February 4, 2026). On Qwen3-32B across 8 vLLM pods on H100 GPUs, v0.5 routing delivered 4,500–11,000 output tokens/sec with P50 TTFT of 136–157ms and 109% higher throughput versus baseline Kubernetes round-robin (llm-d Blog, “llm-d 0.5: Sustaining Performance at Scale,” February 4, 2026). On a Wide-EP topology using B200 GPUs across 32 GPUs, the reported figure reaches approximately 50,000 output tokens/sec (llm-d Blog, “llm-d 0.5: Sustaining Performance at Scale,” February 4, 2026).

These numbers are worth reading carefully. The 13.9× offloading gain and 109% throughput improvement are measured on specific high-end hardware topologies — 8 or more pods, H100 or B200 GPUs — that do not represent a typical self-hosted setup. A team running 2–3 A100 nodes will see different results. The llm-d benchmarks do not include smaller-cluster configurations, and extrapolating these figures to more modest deployments is unverified.

TurboQuant: Where KV Compression Fits in This Stack

TurboQuant (presented at ICLR 2026 by Google Research) occupies a different position in this architecture: it compresses KV tensors in memory rather than routing or reusing them. According to the Google Research blog (Google Research Blog, “TurboQuant: Redefining AI Efficiency with Extreme Compression.”), TurboQuant achieves 6× KV cache memory reduction via 3-bit key / 2-bit value online vector quantization, with no training required and a claimed zero accuracy loss.

The “zero accuracy loss” claim should be verified against the ICLR paper directly. More immediately relevant for operators: no official Python implementation has been released as of April 2026; Google has indicated a Q2 2026 release (Google Research Blog, “TurboQuant: Redefining AI Efficiency with Extreme Compression.”). Until that ships, TurboQuant is not actionable.

If it does ship as described, TurboQuant would compose well with both KV Packet and llm-d: smaller KV tensors mean more cache fits in the same GPU memory tier, increasing cache hit rates for the llm-d router and potentially widening the practical window for KV Packet’s adapter-based reuse.

Capacity Planning Implications for vLLM Teams

The combined picture suggests a different set of questions for infrastructure planning than the current GPU-sizing calculus:

Routing intelligence, not just memory size, determines throughput. llm-d’s benchmarks show identical hardware producing 109% more throughput with cache-aware routing. Adding GPUs without addressing routing locality returns diminishing value.

Tiered storage changes the cost model. Three-tier offloading (GPU → CPU → filesystem) means effective KV cache size can exceed GPU VRAM. The relevant capacity question shifts from GPU memory size to cache hit rate at each tier and the latency cost of a miss.

KV Packet converts document caches from per-request costs to shared assets. For deployments with a stable shared corpus, offline adapter training is a one-time investment that removes prefill cost for those documents across all subsequent requests.

Workload locality determines whether any of this matters. Cache-aware routing only outperforms round-robin when requests have overlapping cache content. Deployments with low document repetition and diverse user prompts will see minimal benefit. The 109% throughput gain assumes meaningful locality; workloads without it will not replicate that figure.

What to Watch: CNCF Sandbox Caveats and the Adapter Training Gap

Two framing risks appear repeatedly in coverage of these projects and are worth naming directly.

The CNCF Sandbox designation explicitly means llm-d is “not yet widely tested in production” and “bleeding edge” (CNCF Blog, “Welcome llm-d to the CNCF,” March 24, 2026). Sandbox is the lowest of three CNCF maturity tiers. The vendor framing — IBM, Red Hat, and Google positioning llm-d as a replacement for round-robin load balancing — reflects roadmap ambitions, not current production readiness. For teams evaluating a near-term deployment, this distinction matters.

For KV Packet, the gap between benchmarks and operational reality is the adapter training pipeline. The 19.45× TTFT reduction on Needle-in-a-Haystack assumes a pre-cached corpus with pre-trained adapters — getting there requires knowledge distillation over the target document set. Teams with frequently updated document stores need to factor in adapter retraining cadence as part of the operational model.

FAQ

Does llm-d require modifying vLLM itself?

llm-d integrates with vLLM through the Kubernetes Gateway API Inference Extension acting as an Endpoint Picker — a standard Kubernetes ingress extension point (IBM Research Blog, “Donating llm-d to the Cloud Native Computing Foundation.”). vLLM must be configured to emit KVEvents (block creation and eviction metadata) that the KV-Cache Indexer consumes. This is a configuration change, not a fork or patch of vLLM, but it does require vLLM versions that support KVEvent streaming.

Can KV Packet and llm-d be used together?

They target different layers and are not mutually exclusive. KV Packet eliminates recomputation when cached KV tensors are reused; llm-d routes requests to pods that already hold those tensors. Together, KV Packet expands reusable cache entries by making them context-independent while llm-d maximizes hit probability. The operational constraint is that KV Packet requires per-corpus adapter training before reuse is possible.

What happens to TurboQuant’s “zero accuracy loss” claim under independent evaluation?

As of April 2026, no independent replication has been published. The claim originates from Google Research’s ICLR 2026 paper and blog post (Google Research Blog, “TurboQuant: Redefining AI Efficiency with Extreme Compression.”); the “zero accuracy loss” characterization should be read against the specific benchmarks and precision levels defined in the paper. No official implementation is available yet for external testing.

KV Cache Is Becoming a Distributed Infrastructure Layer: What KV Packet and llm-d Mean for Self-Hosted LLM Teams

Why KV Cache Is Outgrowing Per-Node Memory

How KV Packet Eliminates Recomputation: The Soft-Token Adapter Mechanism

KV Packet Performance in Practice: What the Numbers Mean

What llm-d Does: KV-Cache-Aware Routing as a Kubernetes Primitive

llm-d v0.5: Three-Tier Offloading and What the Benchmarks Show

TurboQuant: Where KV Compression Fits in This Stack

Capacity Planning Implications for vLLM Teams

What to Watch: CNCF Sandbox Caveats and the Adapter Training Gap

FAQ

Sources

Enjoyed this article?

Why KV Cache Is Outgrowing Per-Node Memory

How KV Packet Eliminates Recomputation: The Soft-Token Adapter Mechanism

KV Packet Performance in Practice: What the Numbers Mean

What llm-d Does: KV-Cache-Aware Routing as a Kubernetes Primitive

llm-d v0.5: Three-Tier Offloading and What the Benchmarks Show

TurboQuant: Where KV Compression Fits in This Stack

Capacity Planning Implications for vLLM Teams

What to Watch: CNCF Sandbox Caveats and the Adapter Training Gap

FAQ

Sources

Related Articles

KServe + llm-d Claims 57× P90 TTFT. RC1 Ships with a Routing Deadlock and No Migration Guide

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

Azure NAT Gateway Blocks [Tailscale Direct Connect](/articles/crawshaws-i-am-building-a-cloud-what-a-tailscale-co-founders-solo-stack-implies/); v1.96.2 Fixes Container Relay Scaling for AKS

Enjoyed this article?