Table of Contents

Prefill-decode disaggregation is an LLM inference architecture that runs the prompt-processing (prefill) and token-generation (decode) phases on separate, dedicated GPU pools. By eliminating phase interference, it independently optimizes time to first token (TTFT) and inter-token latency (ITL)—metrics that cannot be simultaneously tuned in co-located deployments. As of 2025, it has become the default playbook at every major hyperscaler.

What Is Prefill-Decode Disaggregation?

Every LLM inference request moves through two fundamentally different computational phases:

Prefill processes all input tokens in a single parallel forward pass, building a key-value (KV) cache for the sequence. This phase is compute-bound: GPU cores are saturated, and the work scales with prompt length. A 4,096-token prompt demands proportionally more FLOPS than a 256-token prompt.

Decode generates output tokens autoregressively—one token per forward pass. This phase is memory-bandwidth-bound: each step reads the full KV cache from GPU HBM but executes minimal compute. A GPU performing decode is largely waiting on memory reads, not floating-point operations.

In a co-located system, both phases run on the same GPU within interleaved batches. The interference is structural: a long prefill blocking the GPU delays in-flight decode operations, spiking ITL. Conversely, routing many decode requests together degrades TTFT for new arrivals.

Disaggregation resolves this by routing each phase to a distinct fleet of workers sized and configured for its specific bottleneck.1

The Interference Problem in Co-Located Serving

To understand why disaggregation matters, consider what happens under a continuous batching scheduler when a large prefill arrives while decodes are in-flight.

The prefill occupies GPU compute for the entire duration of its forward pass—potentially hundreds of milliseconds for a long document. Decode requests queued behind it must wait. When they finally resume, the cumulative delay appears as latency spikes in ITL, directly degrading the perceived responsiveness of streaming outputs.

Tuning chunk size in co-located systems only partially addresses this. Smaller prefill chunks reduce ITL spikes but increase TTFT by spreading the prefill across more iterations. There is no single configuration that satisfies strict SLOs for both metrics simultaneously under variable load.2

How Disaggregation Works

The architecture separates the serving cluster into two fleets:

Prefill workers receive incoming requests, execute the full prefill pass in parallel, and generate the KV cache. These nodes are configured for compute efficiency: larger tensor parallelism, higher batch sizes for similar-length prompts, and aggressive memory optimization for throughput.

Decode workers receive the completed KV cache from prefill workers via high-speed transfer, then run the autoregressive generation loop. These nodes favor memory bandwidth: smaller parallelism groups, lower batch sizes for consistent ITL, and fast HBM access patterns.

A router (or scheduler) sits in front of both fleets and handles request dispatch, load balancing, and KV transfer coordination. The router is a critical component—poor routing decisions can negate the benefits of disaggregation.

┌──────────────────────────────────────┐

Incoming Request │ Router / Scheduler │ ─────────────────►│ (load balancing, KV coordination) │ └───────────────┬───────────────────────┘ │ ┌────────────────────┴──────────────────────┐ │ │ ▼ ▼ ┌─────────────────────┐ ┌────────────────────────┐ │ Prefill Workers │ │ Decode Workers │ │ (compute-bound) │ │ (memory-bandwidth- │ │ TP=8, large batch │ ──KV Transfer──► │ bound) TP=4, small │ │ TTFT optimized │ (RDMA/NVLink) │ batch ITL-optimized │ └─────────────────────┘ └────────────────────────┘

The KV transfer step is the central engineering challenge. The KV cache for a long prompt on a 70B model can reach tens of gigabytes. Transferring it over standard Ethernet is a non-starter; production deployments use RDMA (InfiniBand or RoCE), NVLink for intra-node transfers, or PCIe 5.0, targeting transfer times below a single decode iteration (~50ms).1

From Research to Production: The 2025 Inflection Point

The idea of separating prefill and decode was articulated clearly in DistServe, published by Yinmin Zhong et al. and presented at OSDI 2024. The paper demonstrated that disaggregation could serve 7.4x more requests or achieve 12.6x tighter SLOs compared to state-of-the-art co-located systems under the same hardware budget.1

Despite these results, widespread adoption lagged through 2024. The engineering investment was substantial: existing serving stacks were built around the assumption that both phases run on the same workers. Refactoring them required new scheduling logic, KV transfer infrastructure, and routing components.

The shift came in 2025. As the Hao AI Lab at UCSD noted in their retrospective: “When businesses run competitively at full scale, system throughput is not the only most important metric anymore. Taming latency has become increasingly critical to the growth—or even survival—of their businesses.”3

By mid-2025, disaggregation had moved from research into the core architecture of virtually every major inference framework. Meta, LinkedIn, Mistral, and Hugging Face are among the organizations running vLLM with disaggregated serving in production.4

Framework Support: The Current Landscape

As of early 2026, disaggregation is supported—at varying degrees of maturity—across every major open-source inference framework.

FrameworkDisaggregation StatusKV TransferKubernetes-NativeNotable Deployments
vLLMExperimentalNIXL, LMCache, custom connectorsVia llm-dMeta, Mistral, Hugging Face
SGLangProduction-readyRDMA, NVLinkPartialLMSYS, Ant Group
NVIDIA DynamoGA (1.0)NIXLYes (AKS, GKE)GTC 2025 showcase
MooncakeProductionTransfer Engine (RDMA)PartialKimi (Moonshot AI)
llm-dProductionUCCL transportYes (native)AWS, Red Hat
Ray Serve LLMProductionPluggableYesGeneral cloud

vLLM

vLLM’s disaggregated prefilling runs two separate vLLM instances—one prefill, one decode—connected by a configurable KV transfer connector. The --kv-transfer-config flag controls the connector type; supported backends include NIXL (NVIDIA’s transfer library), LMCache, and shared memory for co-located deployments.5

Launch prefill instance

vllm serve meta-llama/Llama-3-70b-instruct
—port 8100
—kv-transfer-config ’{“kv_connector”:“PyNcclConnector”,“kv_role”:“kv_producer”}‘

Launch decode instance

vllm serve meta-llama/Llama-3-70b-instruct
—port 8200
—kv-transfer-config ’{“kv_connector”:“PyNcclConnector”,“kv_role”:“kv_consumer”,“kv_rank”

}‘

SGLang

SGLang includes disaggregation as a first-class feature. The LMSYS team demonstrated it at scale in May 2025, running DeepSeek-R1 on 96 H100 GPUs with a 3-node prefill pool and 9-node decode pool. Results: 52,300 input tokens/second and 22,300 output tokens/second per node—approximately 5x the output throughput of vanilla tensor parallelism on the same hardware.6

NVIDIA Dynamo

Announced at GTC 2025 and reaching GA with Dynamo 1.0, NVIDIA’s framework sits as an orchestration layer above SGLang, vLLM, and TensorRT-LLM. Dynamo handles routing, auto-scaling, and KV transfer coordination while delegating actual inference to the underlying engine. NVIDIA reports up to 30x more requests served compared to non-disaggregated baselines, though this figure encompasses their full optimization stack, not disaggregation alone.7

llm-d

llm-d is a Kubernetes-native inference framework backed by Red Hat and AWS that treats disaggregation as an architectural primitive. Its v0.5 release validated ~3,100 tokens/second per B200 decode GPU and up to 50,000 output tokens/second on a 16×16 B200 prefill/decode topology. AWS now offers managed disaggregated inference powered by llm-d.8

Mooncake: The Production Blueprint

Moonshot AI’s Mooncake is the most mature publicly documented production deployment of disaggregation, serving as the infrastructure for Kimi—one of the highest-volume LLM chat services.2

Mooncake takes disaggregation further than most frameworks by making KV cache the architectural center of gravity rather than a side effect of phase separation. Its key components:

  • Transfer Engine: RDMA-based KV transfer using libfabric, supporting both EFA (for AWS) and ConnectX NICs (for on-premise InfiniBand). Transfer is pipelined to overlap with decode computation.
  • Mooncake Store: A distributed KV storage engine that pools CPU DRAM, NVMe SSDs, and GPU HBM across the cluster as a unified cache tier. Frequently reused KV blocks (for common system prompts, for example) persist across requests.
  • Conductor: A global scheduler that dispatches requests to prefill and decode workers based on current KV cache locality, load distribution, and latency SLO state.

Presented at FAST ‘25, Mooncake reported 59% to 498% improvement in effective request capacity compared to baseline methods—a wide range that reflects the strong dependency on workload characteristics, particularly prompt length distribution and cache hit rate.9

At time of writing, Mooncake is operational across thousands of nodes and processes over 100 billion tokens per day.2

The KV Transfer Bottleneck

The most common failure mode in disaggregated deployments is underestimating KV transfer cost. For a 70B model processing a 4,096-token prompt with a batch of 8:

  • KV cache size ≈ (2 layers × num_heads × head_dim × num_tokens × batch_size × float16_bytes)
  • At 80 layers, 8 KV heads, 128 head_dim: ~4.3 GB per request

Transferring 4.3 GB over 100Gbps Ethernet takes ~344ms—longer than most TTFT SLOs. Over InfiniBand HDR (200Gbps), it drops to ~172ms; over NVLink (600+ GB/s for NVLink 4.0 within a node), transfer is near-instantaneous.

This is why production disaggregated deployments require InfiniBand or equivalent high-speed interconnects between prefill and decode nodes. TCP-based transfer is an engineering experiment, not a production configuration.

FlowKV, a recent research framework, demonstrated that intelligent pipelining and scheduling of KV transfers can reduce average transmission latency by 96%, from 944ms to 53ms, primarily by overlapping transfer with ongoing decode computation rather than serializing the handoff.10

When to Use Disaggregation—and When Not To

Disaggregation adds operational complexity. It is not the right choice for every deployment.

Use disaggregation when:

  • Your workload operates at scale (hundreds of concurrent requests or more)
  • You have strict, differentiated SLOs for TTFT and ITL that cannot be satisfied simultaneously with co-location
  • Prompt lengths are long (2,000+ tokens) or highly variable, causing prefill stalls under co-location
  • You have InfiniBand or equivalent interconnects between GPU nodes
  • You need to scale prefill and decode capacity independently based on traffic patterns

Stick with co-location when:

  • Your deployment is small (single node or a few GPUs)
  • Prompts are short with high prefix cache hit rates (local prefill is faster than KV transfer)
  • You lack high-speed inter-node interconnects
  • Your latency SLOs are loose enough that chunked prefill resolves interference adequately

Chunked Prefill vs. Full Disaggregation

Chunked prefill is a middle-ground technique that breaks large prefill requests into smaller chunks and batches them with decode iterations. It partially mitigates ITL spikes without requiring a separate prefill fleet.

DimensionChunked PrefillFull Disaggregation
ITL controlPartial (chunk size dependent)Strong (phases fully isolated)
TTFT impactRegression (prefill spread across iterations)Improved (dedicated prefill hardware)
Tail latencyHard to control with chunk tuningReliably controlled
Hardware requirementSingle fleetTwo fleets + high-speed interconnect
Operational complexityLowHigh
Throughput (large scale)Lower than disaggregationHigher (phase-specific parallelism)
Minimum scale to benefitAnyMedium to large (50+ GPUs)

Chunked prefill is the correct choice when you want better ITL without the operational overhead of disaggregation, and when tail latency guarantees are not required. Full disaggregation is the correct choice when you need predictable tail ITL, have strict SLOs, and are operating at a scale where the infrastructure investment pays off.4

The Road Ahead

The research front has moved beyond basic prefill/decode separation. Several directions are active as of early 2026:

Intra-GPU disaggregation (Nexus, arXiv 2507.06608) explores running prefill and decode concurrently on different SM (streaming multiprocessor) partitions within a single GPU, avoiding the network transfer cost entirely for single-node deployments.

Partial prefill disaggregation (Not All Prefills Are Equal, arXiv 2603.13358) observes that in multi-turn conversations, only the new tokens in each turn need prefill on the prefill workers; cached turns can be served locally, reducing transfer volume.

CXL-based KV transfer (TraCT) explores using CXL shared memory as a zero-hop KV transfer substrate between GPUs in a rack, bypassing NIC overhead entirely for rack-scale deployments.

The fundamental insight—that prefill and decode are structurally different workloads that benefit from dedicated resources—has proven robust enough to drive an entire ecosystem of frameworks, hardware architectures, and operating models. For engineers building inference infrastructure today, it is less a question of whether to use disaggregation than when and how.


Frequently Asked Questions

Q: Does prefill-decode disaggregation work with prefix caching? A: Yes, and the combination is powerful. If a prefill worker has already computed and cached KV blocks for a common prefix (system prompt, document preamble), subsequent requests can skip those blocks. Frameworks like Mooncake and SGLang integrate prefix caching directly into their disaggregated scheduling, routing requests to prefill workers that already hold relevant cache entries.

Q: How many GPUs do I need before disaggregation makes sense? A: There is no universal threshold, but most practitioners report that disaggregation becomes net-positive around 16-32 GPUs for typical chat workloads, and sooner for long-context or document processing workloads where prefill interference is more severe. Below 8 GPUs, co-location with chunked prefill is almost always the better choice.

Q: What network infrastructure is required? A: Production disaggregated deployments require high-speed inter-node interconnects. InfiniBand HDR (200Gbps) or NDR (400Gbps) is the standard for on-premise deployments. AWS deployments use EFA (Elastic Fabric Adapter). Standard 100Gbps Ethernet can work for smaller models or shorter prompts, but will become a bottleneck for large models with long contexts.

Q: Can vLLM run disaggregated serving in production today? A: vLLM’s disaggregated prefilling is labeled experimental as of early 2026, but it is running in production at companies like Meta and Hugging Face. The “experimental” label reflects ongoing API changes, not instability. For production use, organizations often deploy vLLM through llm-d or NVIDIA Dynamo, which add the routing and orchestration layers that vLLM itself does not provide.

Q: How does disaggregation affect cost? A: Disaggregation typically reduces cost per token at scale by improving GPU utilization. The LMSYS team’s DeepSeek-R1 deployment on 96 H100s calculated a cost of $0.20 per million output tokens with PD disaggregation—approximately one-fifth the cost of the official DeepSeek Chat API at the time of publication. However, the fixed overhead of running a router and potentially idle fleet capacity can increase costs at low utilization.

Footnotes

  1. Zhong, Yinmin et al. “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.” USENIX OSDI, 2024. https://arxiv.org/abs/2401.09670 2 3

  2. Qin, Ruoyu et al. “Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.” arXiv, July 2024; presented at USENIX FAST ‘25. https://arxiv.org/abs/2407.00079 2 3

  3. Hao AI Lab, UC San Diego. “Disaggregated Inference: 18 Months Later.” 2025. https://haoailab.com/blogs/distserve-retro/

  4. BentoML. “Prefill-decode disaggregation.” LLM Inference Handbook. https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation 2

  5. vLLM. “Disaggregated Prefilling (experimental).” vLLM Documentation. https://docs.vllm.ai/en/latest/features/disagg_prefill/

  6. LMSYS Org. “Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs.” May 2025. https://lmsys.org/blog/2025-05-05-large-scale-ep/

  7. NVIDIA. “Introducing NVIDIA Dynamo: A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models.” NVIDIA Technical Blog, 2025. https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/

  8. AWS. “Introducing Disaggregated Inference on AWS powered by llm-d.” AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/introducing-disaggregated-inference-on-aws-powered-by-llm-d/

  9. USENIX. “Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot.” FAST ‘25. https://www.usenix.org/conference/fast25/presentation/qin

  10. FlowKV Authors. “FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling.” arXiv, April 2025. https://arxiv.org/html/2504.03775v1

Enjoyed this article?

Stay updated with our latest insights on AI and technology.