Topic
#LLM serving
3 articles exploring LLM serving. Expert insights and analysis from our editorial team.
Showing 1โ3 of 3 articles
Articles
Newest first
Infrastructure & Runtime
K-Token Merging Compresses Sequences in Latent Space, Lowering KV Cache Floors for 24GB and 48GB Cards
K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.
Industry & Business
KV Packet's Recomputation-Free Cache Exposes a Gap in How Cloud AI Vendors Price Multi-Document RAG Inference
KV Packet proves near-zero-FLOPs context-independent KV reuse is achievable, exposing how prefix-only vendor caching tiers structurally exclude multi-document RAG.
Infrastructure & Runtime
Prefill-Decode Disaggregation: The Architecture Shift Redefining LLM Serving at Scale
Prefill-decode disaggregation separates compute-bound prefill from memory-bound decode onto dedicated hardware, eliminating phase interference.