vLLM 0.21 Makes Prefill-Decode Disaggregation Actually Practical

vLLM’s disaggregated serving model splits prefill and decode across separate GPU pools, shuttling KV cache blocks from prefill nodes to decode nodes over a one-way pipe. Reported changes in the v0.21 development cycle add bi-directional KV cache transfers via NVIDIA’s NIXL transfer library, which would let schedulers move KV blocks back from decode to prefill under pressure. If the upstream PRs land as described, static P/D node ratios become a runtime scheduling decision rather than a provisioning-time constant.

How P/D disaggregation works today

vLLM uses PagedAttention to manage KV cache in fixed-size blocks, according to [vLLM’s documentation]¹. In a disaggregated deployment, prefill nodes compute the full prompt KV cache, serialize it into blocks, and transfer them to decode nodes over the network. Decode nodes receive those blocks and generate tokens. The flow is strictly unidirectional: prefill produces, decode consumes.

On H100 hardware, vLLM achieves roughly 12,500 tokens per second throughput on Llama 3.1 8B, with sub-80ms time-to-first-token at low concurrency, per [third-party benchmarks]². Inter-token latency stays in the 11 to 21ms range across workloads on the same hardware². vLLM supports NVIDIA GPUs, AMD GPUs, Google TPUs, Intel Gaudi, and additional hardware targets including IBM Spyre and Apple Silicon¹.

The V1 scheduler delivered a 1.7x throughput improvement over the V0 architecture, per [independent benchmarks]².

NIXL: the transfer layer

NVIDIA’s Inference Transfer Library (NIXL) is an open-source, vendor-agnostic data-movement library designed for disaggregated inference frameworks³. It is targeted at inference frameworks such as NVIDIA Dynamo⁴. Core use cases include KV cache transfers between prefill and decode workers, long-context KV cache storage, and elastic expert parallelism³.

NIXL exposes a backend plugin model. Documented backends include UCX, Libfabric, POSIX, CUDA GDS, and object storage, according to [DeepWiki’s overview]⁵ and the [NIXL repository]⁴. A Mooncake plugin supports both DRAM and VRAM transfers. Mooncake in this context is a transfer engine, not the Chinese bakery item that pollutes search results.

The library supports synchronous and asynchronous metadata exchange, including etcd-based coordination between nodes⁴⁵, allowing agents to be added and removed at runtime. That property is what makes dynamic P/D node-count ratios technically feasible: a scheduler can register new decode nodes or drain prefill nodes without restarting the cluster.

What bi-directional transfers change

In the current one-way model, prefill nodes push KV blocks to decode nodes and forget about them. If a decode node hits capacity pressure, or a request needs re-prefilling because a long chain-of-thought exceeds the decode node’s KV budget, the only option is to recompute the full KV cache from scratch on another prefill node. That wastes both compute and transfer bandwidth.

Bi-directional transfers would let a scheduler move KV blocks from decode back to prefill. The reported changes in v0.21 add this reverse path, according to upstream PRs that have not been independently verified against published releases. If accurate, the consequences are direct:

Decode nodes under memory pressure can offload KV blocks to prefill capacity rather than dropping requests.
Continuations that need re-prefilling can migrate back toward prefill capacity without full recomputation.
P/D node-count ratios shift from a provisioning-time constant to a runtime variable that the scheduler adjusts.

The telemetry gap

This is the operational angle. If P/D ratios become dynamic, autoscalers built on the old assumption (static prefill pool, static decode pool, one-way KV flow) will make incorrect scaling decisions. An autoscaler that only monitors decode-node queue depth will not see prefill starvation caused by reverse KV transfers flooding the prefill pool. An autoscaler that only monitors prefill throughput will miss decode-side memory pressure that bi-directional rebalancing was supposed to relieve.

NIXL’s metadata exchange gives the transport layer a mechanism for node discovery. But observability at the KV-connector layer, specifically transfer latency, block throughput, and error rates between individual P/D pairs, is what autoscalers need to consume. The reported Mooncake KVConnectorStats class would address this, if it ships.

Where the competition stands

Since NIXL is designed as a vendor-agnostic transfer library³, the transport-level capability for bi-directional movement could extend to other inference stacks. The differentiator is what the scheduler above NIXL does with it. As of the available sources, no other inference engine has published documentation of bi-directional KV cache scheduling, but the NIXL plugin model means the transport layer does not block it.

NVIDIA’s [developer blog on NIXL]³ positions the library as a universal data-movement solution for inference workloads. Independent benchmarks of bi-directional KV transfer overhead are not available in the current source set. Operators should treat vendor throughput claims for this specific path as unvalidated until independently measured.

The broader trend is clear regardless of the specific version number: disaggregated serving is moving from a static partitioning problem to a dynamic scheduling one. The teams that instrument the transfer layer first will be the ones who can safely tune their autoscalers when bi-directional paths ship.

Frequently Asked Questions

Does bi-directional KV transfer help for short-prompt workloads?

For uniformly short prompts (under ~1,000 tokens) with stable decode lengths, the one-way pipeline is already near-optimal. The reverse path adds scheduling overhead that only pays off when decode nodes hit memory pressure or long chain-of-thought continuations force re-prefilling. Chat summarization and classification endpoints are unlikely to see throughput gains from enabling it.

How do SGLang and TensorRT-LLM handle disaggregated KV transfers?

Both SGLang and TensorRT-LLM consume NIXL as their transfer layer, but neither has published bi-directional KV cache scheduling. Their deployments still assume one-way prefill-to-decode flow. The transport capability exists in NIXL for all three frameworks; the scheduler integration to exploit reverse movement is specific to vLLM’s reported v0.21 cycle.

Can bi-directional transfers work without NVLink or InfiniBand?

NIXL’s Mooncake plugin supports DRAM-to-DRAM transfers alongside VRAM paths, so operators without high-speed GPU interconnects can run reverse KV transfers over standard networking. The cost is a GPU-to-host-memory copy on each end, which may negate the compute savings from avoiding full recomputation on long contexts. CUDA GDS and Libfabric backends remain the high-throughput options.

What does PagedAttention’s block size mean for reverse-transfer efficiency?

PagedAttention uses fixed-size blocks of typically 16 tokens with under 4% memory waste. Block boundaries are preserved during reverse transfers, so short KV sequences migrated back for re-prefilling carry that overhead along. The effective bandwidth utilization on the reverse path can be lower than the forward path, where blocks are typically full after prefill computation.

Could elastic expert parallelism contend with KV transfers on the same NIXL fabric?

NIXL serves both KV cache transfers and elastic expert parallelism as first-class use cases. On MoE deployments with dynamic expert routing and bi-directional P/D rebalancing, both workloads share the same transport layer and physical links. Expert-parallel traffic spikes during high concurrency can collide with decode-to-prefill migration events, making per-backend utilization monitoring essential.