Topic

#kv-cache

4 articles exploring kv-cache. Expert insights and analysis from our editorial team.

Showing 1โ€“4 of 4 articles

Articles

Newest first
Infrastructure & Runtime

K-Token Merging Compresses Sequences in Latent Space, Lowering KV Cache Floors for 24GB and 48GB Cards

K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.

Industry & Business

KV Packet's Recomputation-Free Cache Exposes a Gap in How Cloud AI Vendors Price Multi-Document RAG Inference

KV Packet proves near-zero-FLOPs context-independent KV reuse is achievable, exposing how prefix-only vendor caching tiers structurally exclude multi-document RAG.

Infrastructure & Runtime

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.

Infrastructure & Runtime

KV Cache Is Becoming a Distributed Infrastructure Layer: What KV Packet and llm-d Mean for Self-Hosted LLM Teams

KV Packet eliminates cross-request recomputation; llm-d brings cache-aware routing to Kubernetes. Here's what both mean for vLLM capacity planning.

ยท 6 min read