#vllm
6 articles exploring vllm. Expert insights and analysis from our editorial team.
Articles
KServe + llm-d Claims 57× P90 TTFT. RC1 Ships with a Routing Deadlock and No Migration Guide
Red Hat's KServe + llm-d integration claims 57× P90 TTFT gains against an unoptimized vLLM baseline, but RC1 ships with a known routing deadlock, a prematurely merged WIP.
UCCL-Zip: Lossless Compression for NCCL, 47.5% Faster RL Sync, 10% Lower vLLM Latency
UCCL-Zip fuses lossless compression into NCCL and GPU P2P transfers, cutting RL weight sync by 47.5% and vLLM latency by 10% with no API changes and bit-identical outputs.
LACE Forces vLLM and SGLang to Rethink How Parallel Reasoning Threads Run
LACE lets parallel reasoning threads share state mid-inference, yielding 3-7 point accuracy gains but forcing vLLM and SGLang to abandon independent-sequence batching.
vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe
vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.
KV Cache Is Becoming a Distributed Infrastructure Layer: What KV Packet and llm-d Mean for Self-Hosted LLM Teams
KV Packet eliminates cross-request recomputation; llm-d brings cache-aware routing to Kubernetes. Here's what both mean for vLLM capacity planning.
The Complete Guide to Local LLMs in 2026
Why [running AI on your own hardware](/articles/vllm-block-level-preemption-and-flexkv-shift-the-long-context-bottleneck-from/) is becoming the default choice for privacy-conscious developers and enterprises alike