Ulysses sequence parallelism is the standard recipe for scaling diffusion transformer (DiT) inference across nodes, but it carries a hidden cost: the all-to-all collective communication that shuffles attention states between devices does not scale linearly with node count on heterogeneous interconnects. CoCoDiff, submitted to arXiv on April 16, 2026 and revised April 21, 2026, identifies this bottleneck and proposes topology-aware scheduling and communication reduction techniques that yield an average 3.6x speedup and peak 8.4x on the Aurora supercomputer’s Intel GPU tiles (CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism).
What Ulysses SP Promised — and Where the All-to-All Wall Appears
Ulysses sequence parallelism partitions the input along the sequence dimension and redistributes attention states via all-to-all collectives. The original DeepSpeed-Ulysses work showed that when sequence length and compute devices increase proportionally, communication volume stays constant, enabling training of extreme long-sequence models with over 10x communication reduction and sustained throughput above 175 TFlops/GPU (DeepSpeed-Ulysses Blog and Documentation). That constant-volume property made Ulysses the default choice for scaling transformer attention across nodes.
The assumption, however, breaks down when the interconnect is not uniform. On heterogeneous or multi-hop networks, the all-to-all latency scales non-linearly with node count. Each attention layer in a DiT model requires a full all-to-all exchange of query, key, and value projections across all participating devices. As nodes are added, the time spent in collective communication eventually dominates the computation time, creating a wall where additional hardware yields diminishing or negative returns. CoCoDiff’s core observation is that this wall appears earlier for DiT inference than the Ulysses analysis suggests, because inference batch sizes and step counts amplify the per-step communication cost across hundreds of denoising iterations (CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism).
Why DiT Inference Is Especially Vulnerable to Collective Communication Overhead
Diffusion transformers replace the U-Net backbone with a transformer operating on latent patches, with higher Gflops correlating directly to lower FID scores (Scalable Diffusion Models with Transformers (DiT)). This predictable scaling behavior means DiT serving clusters are designed to add nodes to reduce per-sample latency or increase throughput. But DiT inference is not batched training: each denoising step is a full forward pass, and a typical generation run involves tens to hundreds of steps. If each step incurs an all-to-all penalty, the aggregate communication time compounds.
Unlike large language model (LLM) inference, where techniques like prefill-decode disaggregation and speculative decoding can hide communication behind computation, DiT attention requires the full sequence state at every layer. There is no equivalent of a “small draft model” for diffusion steps, and the spatial coherence requirements of latent patches make sequence partitioning non-trivial to skip. The result is that DiT inference clusters are communication-bound earlier in their scaling curve than LLM clusters of comparable size.
CoCoDiff’s Three Mechanisms: TAPA, V-First, and V-Major
CoCoDiff introduces three mechanisms to reduce and reshape the all-to-all traffic (CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism):
Tile-Aware Parallel All-to-All (TAPA) restructures the collective communication to match the physical topology of the interconnect. On Aurora’s Slingshot-11 network, TAPA maps communication patterns to the node-local GPU tiles before crossing node boundaries, reducing the volume of traffic that traverses the slower inter-node links. The topology-aware approach contrasts with the default Ulysses implementation, which treats all devices as a flat pool.
V-First scheduling exploits an asymmetry in the Q/K/V projection pipeline. The value (V) projection is a linear transform, while query (Q) and key (K) require normalization and rotary position embedding (RoPE) before the attention computation. By scheduling V’s all-to-all exchange ahead of Q and K, CoCoDiff overlaps V’s communication with the Q/K preprocessing, hiding latency that would otherwise be on the critical path.
V-Major selective communication filters redundant V tensors across adjacent denoising steps. Temporal redundancy exists because consecutive diffusion steps operate on similar latent states; CoCoDiff identifies and suppresses all-to-all transfers for V values that have not changed meaningfully between steps. This is a lossy optimization, but the paper bounds the approximation error to preserve generation quality.
The Aurora Benchmarks: 3.6x Average, 8.4x Peak on Intel GPU Tiles
The CoCoDiff evaluation runs four DiT models across one to eight nodes on the Aurora supercomputer, using up to 96 Intel Data Center GPU Max (Ponte Vecchio) tiles (CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism, Aurora (supercomputer) - Wikipedia. Aurora’s 10,624 nodes are connected via Slingshot-11 interconnect and deliver 1.012 exaFLOPS Rmax (Aurora (supercomputer) - Wikipedia, making it a relevant testbed for cross-node scaling at HPC scale.
Across the tested configurations, CoCoDiff achieves an average speedup of 3.6x over the baseline Ulysses implementation, with peak speedup reaching 8.4x (CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism). The 8.4x figure is a best-case scenario under favorable topology and model conditions, not a guarantee for all workloads. The average is arguably the more informative number for capacity planning: it reflects the expected improvement when communication patterns are irregular and temporal redundancy varies across denoising stages.
What This Means for Production DiT Serving Teams
Teams running DiT inference at scale currently have two options for cross-node parallelism: pipeline parallelism, which slices layers across nodes and suffers from bubble overhead, and Ulysses sequence parallelism, which keeps layers local but pays the all-to-all cost. CoCoDiff does not eliminate that choice, but it suggests that Ulysses without topology awareness is leaving significant throughput on the table.
The implication is that raw node count is the wrong optimization target. A cluster with fewer nodes but a better-mapped interconnect may outperform a larger, flat-topology cluster. For teams building or renting DiT serving infrastructure, this means interconnect topology should be treated as a first-class design constraint, not an afterthought. The TAPA mechanism in particular suggests that node-local GPU density and the routing efficiency between nodes matter as much as aggregate FLOPS.
Caveats: Intel GPUs, Slingshot-11, and the NVLink Question
The CoCoDiff results are specific to Intel’s Ponte Vecchio architecture and the Cray Slingshot-11 interconnect. NVIDIA GPU clusters using NVLink for intra-node and InfiniBand for inter-node communication have different latency and bandwidth profiles. TAPA’s topology-aware scheduling is conceptually portable, but the exact mappings and speedups will depend on the specific network topology and GPU tile layout of the target cluster.
The 8.4x peak speedup should not be used for ROI calculations. It is a peak observed under specific model and configuration combinations. The 3.6x average is a more defensible planning figure, but even that assumes an Aurora-like environment. Teams evaluating CoCoDiff for cloud deployments should expect to re-benchmark on their own interconnects before drawing conclusions about hardware procurement or cluster resizing.
Finally, V-Major’s selective communication introduces a quality-throughput tradeoff that the paper bounds but does not fully characterize across all DiT model families. Production teams will need to validate that the approximation preserves FID or other quality metrics for their specific use case before enabling it at scale.
Frequently Asked Questions
Does CoCoDiff apply to NVIDIA GPU clusters with NVLink and InfiniBand?
The mechanisms are conceptually portable, but the 3.6x average and 8.4x peak speedups were measured on Intel Ponte Vecchio tiles with Slingshot-11 interconnect. Teams should expect different absolute numbers and will need to re-benchmark on their own interconnects before making hardware decisions.
How does CoCoDiff differ from standard Ulysses sequence parallelism?
Standard Ulysses treats all devices as a flat pool and assumes constant communication volume. CoCoDiff adds Tile-Aware Parallel All-to-All to match physical topology, V-First scheduling to overlap V communication with Q/K preprocessing, and V-Major selective communication to filter redundant transfers across denoising steps.
What do teams need to change to adopt CoCoDiff in production?
As of April 2026, no production inference framework has adopted CoCoDiff. Teams will need to implement or port these optimizations themselves, including topology mapping for their specific interconnect and validation that V-Major’s approximation preserves generation quality for their use case.
Why is the 8.4x peak speedup not a reliable number for capacity planning?
The 8.4x figure is a best-case scenario under favorable topology and model conditions. The 3.6x average reflects expected improvement when communication patterns are irregular and temporal redundancy varies across denoising stages.
What makes DiT inference more vulnerable to communication overhead than LLM inference?
DiT inference requires the full sequence state at every attention layer with no equivalent of a draft model to hide communication behind computation. The spatial coherence requirements of latent patches also make sequence partitioning non-trivial to skip, so DiT clusters become communication-bound earlier in their scaling curve than LLM clusters of comparable size.