Serving Cold MoE Models: CrossPool Disaggregates KV Cache and Weights

arXiv:2606.24506, a paper from Beihang University posted on 23 Jun 2026, attacks a memory-economics problem in multi-tenant MoE serving: when a cluster hosts many checkpoints and most of them are cold, static FFN weights squat on GPU memory and crowd out the KV-cache that active requests actually need. Its answer is to put those two memory classes in separate pools and ship hidden states across the boundary, not KV tensors.

Why does hosting many MoE checkpoints waste GPU memory?

The binding constraint is residency, not compute. Sparse mixture-of-experts models carry large FFN weight matrices that stay resident in VRAM whether or not anyone is querying them, and in a multi-model deployment the long tail of checkpoints dominates the memory budget while contributing almost nothing to throughput.

The paper quantifies this with a cumulative distribution of token consumption on OpenRouter: roughly 90% of models are cold and consume few tokens (arXiv:2606.24506, full text). Reserving worst-case KV capacity per model on that distribution is wasteful by construction. The current generation of multi-LLM engines, MuxServe and kvcached, colocate weights and KV-cache in a single monolithic GPU pool, which couples KV capacity to weight footprint (arXiv:2606.24506, full text). That coupling is the root of the symptoms operators already recognize: low per-GPU KV capacity, weak long-context support, and requests that get rejected or OOM-killed when a handful of cold models’ weights have the cards pinned flat.

There is a subtler failure too, and it is algorithmic. KV-head-limited attention variants (MLA, MQA, GQA with fewer heads than the data-parallel group size) expose only a fraction, a half or a quarter, of replicated KV capacity to any single request under cold, low-concurrency traffic (arXiv:2606.24506, full text). A data-parallel replica can look fat on a capacity spreadsheet and sit effectively unusable, because the request cannot reach the KV head it needs on the replica it landed on.

What does CrossPool actually split, and what does it leave alone?

CrossPool separates two address spaces, not the workload. A weights pool consolidates FFN weights across the cold models, and a KV-cache pool dynamically serves whichever requests are actually active (arXiv:2606.24506). Attention and its KV-cache live in the second pool; the FFN, which dominates MoE parameter count, lives in the first (arXiv:2606.24506, full text).

Because the FFN weights are the bulky resident, pooling them across many underused models is where the memory savings concentrate: the same weight memory that was reserved per model now backs the whole cold fleet, freeing VRAM for the KV-cache where the contention actually is. The KV-cache pool, no longer forced to coexist with pinned weights, can grow to the capacity that active long-context requests need.

The paper lists three mechanisms that make the split pay off rather than reshuffle the problem. A KV-cache planner and virtualizer performs offline budget and parallelism planning with virtualized paging. A layer-wise pipeline scheduler overlaps hidden-state transfers with compute across in-flight batches. Persistent kernels with control lowering cut the CPU-to-GPU control overhead that eats throughput in small-batch regimes (arXiv:2606.24506, full text). The first two make the pools cooperate; the third ensures you do not pay a synchronization tax for the privilege of having split them.

Hidden states, not KV-cache: why isn’t this prefill-decode disaggregation?

The boundary CrossPool draws is inside the transformer layer, between attention and the FFN, so what crosses between pools is a hidden-state tensor, not a KV-cache tensor (arXiv:2606.24506, full text). Attention stays local to the KV-cache pool because that is where the keys and values it reads live. The FFN executes in the weights pool because that is where the weight matrices it multiplies live. Each pool does the half of the layer it is positioned to do without dragging the other pool’s memory across the wire.

This is a different cut from the disaggregation pattern production serving stacks have been converging on. Prefill-decode disaggregation, as implemented in NVIDIA Dynamo and examined in Groundy’s piece on prefill-decode disaggregation, splits the request lifecycle across phases and transfers KV-cache between them. CrossPool splits a layer across memory classes and transfers hidden states. Same family of idea, separating the things with different lifetimes and footprints, but a different axis. The two are compatible rather than competing; a system could disaggregate both ways.

How real is the 10.4× time-between-tokens claim?

The 10.4× is real but narrow: it is an “up to” figure measured against a kvcached-based multi-LLM serving system, and the paper itself concedes only competitive performance on balanced workloads (arXiv:2606.24506).

Two things limit how far that number travels. The large gains concentrate in the cold, long-context, low-concurrency regime that motivated the design in the first place; a deployment whose traffic is hot and evenly distributed across models should expect the “competitive” result, not the headline (arXiv:2606.24506, full text). The baseline matters too. The comparison is against kvcached specifically, not against a production disaggregated stack with RDMA and a tuned transfer path. kvcached inherits the monolithic-pool coupling CrossPool is built to escape (arXiv:2606.24506, full text), so some of the gap is plausibly the baseline’s known weakness rather than CrossPool’s intrinsic edge. Without head-to-head numbers against a Dynamo-class system in the same regime, the absolute magnitude is hard to place.

The first secondary write-up, from Machine Brief on 24 Jun 2026, repeated the 10.4× figure as a “breakthrough for MoE model efficiency” without stress-testing either qualifier. That is the fast-follow coverage to expect: abstract-restating, regime-blind, and quick to blur CrossPool with generic disaggregated serving. The arXiv primary is the source to anchor on.

What does the transfer path cost?

Disaggregating two memory pools means moving a tensor between them on every layer of every active request, and that movement is the tax prefill-decode serving already pays in production. The NVIDIA Dynamo documentation is explicit about where the bottleneck lands: without RDMA the system falls back to TCP, and KV-cache movement dominates time-to-first-token and throughput. CrossPool’s hidden-state boundary inherits the same economics on a different tensor and a different cadence.

CrossPool’s countermeasure is its layer-wise pipeline scheduler, which overlaps the hidden-state transfer for one batch with compute on the batches already in flight, so the network and the compute units stay busy instead of one waiting on the other (arXiv:2606.24506, full text). That works to the extent the transfers can be hidden behind compute, which depends on available batch depth and interconnect bandwidth. On a fabric that cannot feed experts faster than the monolithic-pool contention it replaces, the disaggregation is a net cost rather than a win.

What changes for multi-tenant MoE serving economics?

CrossPool relocates the binding constraint rather than removing it. It moves the bottleneck from per-model VRAM residency to the interconnect that ships weights and hidden states on demand, and that is a real change in where capacity planning has to land. A cluster that was GPU-memory-bound on cold checkpoints can become network-bound on the disaggregation path, and which of those is the cheaper problem to solve is an operator-specific question.

The contrast with other MoE-serving economics work sharpens what CrossPool is and is not. Placing MoE experts on AWS Lambda with Bayesian expert-popularity prediction cut the billed cost of all MoE layers by at least 75.67% versus CPU clusters (arXiv:2501.05313). That is a different lever: placement and billing model, not memory pooling. The two are complementary. One asks where an expert should run to minimize cost; the other asks how weight and KV memory should be organized so the active request can reach it.

What CrossPool does not do is make cold models free. It makes their co-residency cheaper by consolidating the memory that was reserved per model, and it pays for that with a transfer path that has to keep up. The durable insight, the part that outlives this paper’s specific numbers, is the lifetime mismatch: static weights and transient KV-cache live on different timescales, and provisioning for the worst case of each, per model, inside one pool is the structural mistake. CrossPool is one particular way to stop making it. Whether it is the right way for a given cluster comes down to the same wire the rest of disaggregated serving lives or dies on: how fast you can move the bytes.

Frequently Asked Questions

Does the pooling pattern help for dense, non-MoE checkpoints?

Less so. The memory savings concentrate in the FFN, and MoE inflates that resident footprint through replicated expert matrices that a dense model simply does not carry. Dense multi-tenant serving already has mature continuous-batching and paged-KV stacks, so importing CrossPool’s per-layer hidden-state transfer there adds coordination cost against a much smaller consolidation upside.

How should readers compare the 10.4× latency figure to the 75.67% cost cut in the serverless MoE study?

They sit on different axes and cannot be ranked. The 75.67% reduction from arXiv:2501.05313 measures billed cost of MoE layers placed on AWS Lambda against a CPU-cluster baseline, while CrossPool’s 10.4× measures P99 time-between-tokens against kvcached on GPU. One is a billing ratio against a weak baseline, the other a latency ratio within the same device class.

Which attention designs make CrossPool’s algorithmic win smaller?

The win the paper targets depends on KV-head-limited attention, where MLA, MQA, or sub-group GQA expose only a half or quarter of replicated KV capacity to one request. An architecture that reverts to full multi-head attention with head count at or above the data-parallel group size would make per-replica KV fully reachable, dissolving the ‘fat on paper, unusable’ failure and shrinking that component of the gain.

Why are persistent kernels with control lowering load-bearing for this design?

Cold traffic means small batches, and small batches make per-launch CPU-to-GPU control overhead a large fraction of each layer’s wall-clock time. The layer-split boundary multiplies the number of kernel boundaries per forward pass, so without persistent kernels keeping the stream resident and control lowering moving scheduling off the CPU critical path, that overhead would compound and eat the pooling savings.

Does CrossPool require RDMA-class interconnect to net positive?

The paper does not mandate it, but the prefill-decode precedent set by NVIDIA Dynamo is a warning: on that path, absence of RDMA forces a TCP fallback and KV movement dominates time-to-first-token. CrossPool moves a smaller tensor (hidden states, not full KV) on a different cadence (every layer, not per phase), so the RDMA threshold is lower, but a cluster limited to PCIe or TCP-only fabrics should expect the layer-wise scheduler to struggle to hide transfers in cold, low-batch traffic.