Topic

#vllm

6 articles exploring vllm. Expert insights and analysis from our editorial team.

Showing 1–6 of 6 articles

Articles

Newest first
Infrastructure & Runtime

KServe + llm-d Claims 57× P90 TTFT. RC1 Ships with a Routing Deadlock and No Migration Guide

Red Hat's KServe + llm-d integration claims 57× P90 TTFT gains against an unoptimized vLLM baseline, but RC1 ships with a known routing deadlock, a prematurely merged WIP.

Infrastructure & Runtime

UCCL-Zip: Lossless Compression for NCCL, 47.5% Faster RL Sync, 10% Lower vLLM Latency

UCCL-Zip fuses lossless compression into NCCL and GPU P2P transfers, cutting RL weight sync by 47.5% and vLLM latency by 10% with no API changes and bit-identical outputs.

Developer Tools

LACE Forces vLLM and SGLang to Rethink How Parallel Reasoning Threads Run

LACE lets parallel reasoning threads share state mid-inference, yielding 3-7 point accuracy gains but forcing vLLM and SGLang to abandon independent-sequence batching.

Infrastructure & Runtime

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.

Infrastructure & Runtime

KV Cache Is Becoming a Distributed Infrastructure Layer: What KV Packet and llm-d Mean for Self-Hosted LLM Teams

KV Packet eliminates cross-request recomputation; llm-d brings cache-aware routing to Kubernetes. Here's what both mean for vLLM capacity planning.

· 6 min read
Infrastructure & Runtime

The Complete Guide to Local LLMs in 2026

Why [running AI on your own hardware](/articles/vllm-block-level-preemption-and-flexkv-shift-the-long-context-bottleneck-from/) is becoming the default choice for privacy-conscious developers and enterprises alike