vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

The complete feature set spans two releases. vLLM v0.18.0 (March 20, 2026) delivered FlexKV as a pluggable offloading backend and a reuse-gated CPU cache policy; v0.19.0 (April 3, 2026) added block-level preemption and the general CPU offloading connector that ties both together. The combination shifts the binding constraint on long-context serving from GPU memory capacity to PCIe bandwidth — but only behind experimental flags and with unresolved reviewer concerns around memory leaks and preemption state management (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling).

What Actually Shipped in v0.18 vs v0.19

The release notes distribute the features across two versions in a way the article title compresses into one. v0.18.0 contains three pieces: FlexKV as a new offloading backend (vLLM v0.18.0 Release Notes), smart CPU offloading via FilterReusedOffloadingManager (PR #35342: Smart CPU offloading that stores only frequently-reused blocks), and multi-KV-group support in the offloading spec (PR #36610: Support for multiple KV groups in offloading spec). v0.19.0 completed the stack by landing general CPU KV cache offloading with a pluggable cache policy and block-level preemption handling (vLLM v0.19.0 Release Notes).

The distinction matters because block-level preemption is the piece that makes the system stable under concurrent load. Without it, preemption operated at sequence granularity — a coarse mechanism that evicted entire request sequences from GPU memory rather than individual blocks. The v0.19 work replaced that API boundary, which changes what the system can do when multiple long-context requests compete for GPU KV cache simultaneously.

FlexKV: Tencent’s Distributed KV Backend

FlexKV is a distributed KV store and multi-level cache management system built by Tencent Cloud’s TACO team, integrated into vLLM as an optional offloading backend via PR #34328 (PR #34328: FlexKV as a new offloading backend). The design treats host CPU memory as one tier in a hierarchy the connector manages, rather than a direct GPU-to-CPU spill that the inference engine controls.

The performance numbers the TACO team reported are specific to a narrow workload: ISL=21K, OSL=1K, batch_size=8, on an unstated hardware configuration running Qwen3-32B (PR #34328: FlexKV as a new offloading backend). At that operating point, the team claims TTFT drops by 60%, TPOT rises by 13%, and QPM improves by 16% (PR #34328: FlexKV as a new offloading backend). The TPOT increase is the expected cost of a deeper memory hierarchy — tokens that hit CPU cache incur higher per-token decode latency — and the QPM gain reflects better GPU utilization from avoiding OOM-driven request drops.

Smart Offloading: Filtering Blocks by Reuse Frequency

FilterReusedOffloadingManager (PR #35342) addresses a specific inefficiency in naive CPU offloading: when every evicted block is pushed to host memory regardless of whether it will ever be reused, you pay PCIe bandwidth for blocks that will never come back (PR #35342: Smart CPU offloading that stores only frequently-reused blocks). The fix is an O(1) LRU tracker that gates CPU stores by hash lookup frequency. Only blocks that have been looked up at least store_threshold times — default: 2 — are written to host memory (PR #35342: Smart CPU offloading that stores only frequently-reused blocks).

This is less sophisticated than it may appear. The policy tracks reuse count, not workload semantics. A one-shot RAG prefill that happens to generate the same block hash twice will still trigger an offload. A genuinely high-reuse system prompt seen only once before the threshold will not be offloaded at all. Teams that want tighter control will need to tune store_threshold against their actual cache hit distribution.

The PR author explicitly states no formal benchmarks exist for this path (PR #35342: Smart CPU offloading that stores only frequently-reused blocks). The reuse-gating argument is a PCIe bandwidth conservation argument: the claim is that capacity is not wasted on one-shot blocks, not that overall throughput improves.

Block-Level Preemption Changes the Memory Model

Before v0.19, vLLM’s preemption API passed a set of preempted request IDs to the connector (PR #34805: Block-level preemption). The connector knew which sequences were being ejected but had no information about which blocks within those sequences were worth saving. For sliding-window attention — where only a subset of a sequence’s KV blocks are in scope for the current decode step — this meant the old API could not distinguish between blocks that were actively needed and blocks that had already slid out of the window.

PR #34805 replaced this with KVConnectorMetadata, which carries block-level granularity (PR #34805: Block-level preemption). The connector now receives enough information to selectively save sliding-window blocks that have been evicted from GPU KV cache while still being reachable from host memory. For requests with very long contexts and sliding-window attention, this means the working set on GPU shrinks to just the active window, with evicted-but-needed blocks living on CPU until requested.

The second-order effect is a shift in where the bottleneck lives. GPU memory is no longer the binding constraint for long-context concurrency — assuming sufficient host DRAM — but PCIe bandwidth and CPU cache policy now are. A system with adequate GPU memory for one 32K-token request might run several concurrent requests if the CPU offload path is efficient, but if they all cache-miss on CPU simultaneously, they contend on the same PCIe bus. The failure mode changes character; it does not disappear.

Benchmarks and Overhead Numbers

The most concrete numbers come from PR #37160, covering SimpleCPUOffloadConnector running Llama-3.1-8B at an 8k

prompt/output ratio at 2 req/s (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling). Output throughput overhead is reported as +0.00%; TTFT overhead is +1.09% (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling). These near-zero figures are consistent with a low-load single-GPU test where PCIe contention does not materialize.

The multi-turn case with 400GB of host memory is more informative. With offloading enabled, the system achieves 45,151 tok/s versus 35,881 tok/s without offload, and mean TPOT improves from 32.13ms to 21.93ms (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling). That 32% TPOT improvement comes from a different mechanism than TTFT reduction: the offloaded KV cache allows more requests to be scheduled concurrently without GPU OOM, which improves GPU utilization and amortizes decode overhead across a larger batch.

Experimental Flags and Known Issues

Both the general CPU offloading path and SimpleCPUOffloadConnector require two non-default settings: the environment variable VLLM_USE_SIMPLE_KV_OFFLOAD=1 and the server flag --no-disable-hybrid-kv-cache-manager (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling). Opt-in via environment variable rather than standard configuration flags signals that the vLLM team does not consider this path stable for general use.

During code review for PR #37160, reviewers flagged two categories of concern: potential memory leaks and incorrect behavior in state management during request preemption and CPU cache eviction (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling). Neither issue was marked resolved in the PR merge record. Teams activating this path in production should monitor RSS growth over multi-hour serving sessions, since memory leaks in preemption paths tend to surface only under sustained concurrent load.

What Teams Should Tune Before Production

store_threshold in FilterReusedOffloadingManager: Default is 2 hash lookups (PR #35342: Smart CPU offloading that stores only frequently-reused blocks). Lower values push more blocks to CPU — more PCIe traffic, higher hit rate. Higher values are more conservative. The right value depends on your cache reuse distribution, which requires profiling rather than estimation.

Preemption thresholds with mixed attention variants: Multi-KV-group support enforces a single GPU block size via assertions (PR #36610: Support for multiple KV groups in offloading spec). If your model uses per-group block sizes, test explicitly for assertion failures before assuming the offloading path handles them.

Host memory sizing: The 400GB figure in PR #37160 is the configuration the author tested, not a minimum recommendation (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling). PCIe bandwidth saturation is the relevant ceiling to measure — host DRAM capacity sets the offload size, but bandwidth governs latency under concurrent eviction.

FlexKV vs SimpleCPUOffloadConnector: FlexKV suits deployments that already operate a distributed KV tier and need multi-level cache management (PR #34328: FlexKV as a new offloading backend). SimpleCPUOffloadConnector is the lighter path for single-node CPU offload with no external infrastructure dependencies (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling). Both are pluggable under the same connector interface introduced in v0.19.

Before committing either path to production, run sustained concurrent load for several hours and track RSS. The reviewer-flagged memory leak is the failure mode most likely to appear outside a short benchmark window (PR #37160: General CPU KV cache offloading with pluggable cache policy and block-level preemption handling).

Frequently Asked Questions

When does FlexKV’s distributed KV tier justify its infrastructure overhead compared to SimpleCPUOffloadConnector?

FlexKV pays off when your serving topology spans multiple inference nodes that share a common KV cache — prefix caching across nodes, speculative decoding with a remote draft model, or multi-tenant setups where KV blocks must outlive individual request lifecycles. SimpleCPUOffloadConnector is sufficient for single-node deployments where the only goal is extending effective KV cache capacity into host DRAM.

What happens if the CPU offloading flags are omitted?

The offloading connector is never instantiated, and vLLM falls back to its default GPU-only KV cache management with sequence-level preemption. Long-context requests compete for GPU memory as before, with no host DRAM spill path available.

How should I detect the reviewer-flagged memory leak in production?

Track RSS per vLLM process on a rolling window. A linear increase over hours of sustained concurrent serving — particularly under workloads with frequent KV cache churn from preemption and eviction — is the signal. The leak is in preemption state management, so steady-state decode with minimal preemption may not surface it.

What workload profiles benefit most from lowering store_threshold below the default of 2?

Workloads with high prefix reuse across requests — shared system prompts, templated RAG contexts, or multi-turn conversations where earlier turns are repeatedly re-referenced. Lowering the threshold offloads these blocks earlier, increasing the probability that they survive GPU eviction. The tradeoff is higher PCIe traffic per offload cycle, which matters when the bus is already contended.

Can I use multi-KV-group offloading safely if my model has uniform block sizes across groups?

The current assertion only rejects per-group block size variation, so uniform-block-size models should pass the assertion. However, the feature was merged as preparatory groundwork, not as a production-hardened path. Run explicit multi-group tests under concurrent load before relying on it — the single-block-size constraint is the documented limit, but untested edge cases may exist.

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

What Actually Shipped in v0.18 vs v0.19

FlexKV: Tencent’s Distributed KV Backend

Smart Offloading: Filtering Blocks by Reuse Frequency

Block-Level Preemption Changes the Memory Model

Benchmarks and Overhead Numbers

Experimental Flags and Known Issues

What Teams Should Tune Before Production

Frequently Asked Questions

When does FlexKV’s distributed KV tier justify its infrastructure overhead compared to SimpleCPUOffloadConnector?

What happens if the CPU offloading flags are omitted?

How should I detect the reviewer-flagged memory leak in production?

What workload profiles benefit most from lowering store_threshold below the default of 2?

Can I use multi-KV-group offloading safely if my model has uniform block sizes across groups?

Sources

Enjoyed this article?

What Actually Shipped in v0.18 vs v0.19

FlexKV: Tencent’s Distributed KV Backend

Smart Offloading: Filtering Blocks by Reuse Frequency

Block-Level Preemption Changes the Memory Model

Benchmarks and Overhead Numbers

Experimental Flags and Known Issues

What Teams Should Tune Before Production

Frequently Asked Questions

When does FlexKV’s distributed KV tier justify its infrastructure overhead compared to SimpleCPUOffloadConnector?

What happens if the CPU offloading flags are omitted?

How should I detect the reviewer-flagged memory leak in production?

What workload profiles benefit most from lowering store_threshold below the default of 2?

Can I use multi-KV-group offloading safely if my model has uniform block sizes across groups?

Sources

Related Articles

K-Token Merging Compresses Sequences in Latent Space, Lowering KV Cache Floors for 24GB and 48GB Cards

KV Cache Is Becoming a Distributed Infrastructure Layer: What KV Packet and llm-d Mean for Self-Hosted LLM Teams

IonRouter (YC W26): The Custom NVIDIA GH200 Runtime Targeting the LLM Inference Cost Crisis

Enjoyed this article?