DMax Hits 1,338 Tokens/Sec on 2x H200: Parallel Decoding Pushes dLLM Serving Past the Autoregressive Bar

Chen et al.’s revised DMax paper, posted to arXiv on May 15, reports 1,338 tokens per second¹ on two H200 GPUs at batch=1 using Soft Parallel Decoding. The result requires retraining LLaDA-2.0-mini with On-Policy Uniform Training, which reformulates diffusion LLM inference as progressive embedding refinement rather than binary mask-to-token commitment. For platform teams, the significant claim is that parallel dLLM decoding may no longer force a speed-quality trade-off, raising the question of whether serving stacks built around autoregressive left-to-right assumptions can accommodate span-wise iterative refinement.

What Changed on May 15: The v3 Revision

The third version of DMax¹ arrived on May 15 with finalized benchmark tables and expanded ablations, making the paper citable in a way the earlier preprints were not. Chen et al. at NUS xML Lab introduce two mechanisms: On-Policy Uniform Training (OPUT) and Soft Parallel Decoding (SPD). The authors also released training and evaluation code² along with 16B specialist models (DMax-Math-16B and DMax-Coder-16B).¹ Notably, the inference path depends on specific versions of existing serving engines: sglang==0.5.3.post1 and vllm==0.10.2 for the base LLaDA-2.0-mini architecture. This is not a drop-in speedup that applies to arbitrary diffusion models; it is a full-stack retraining recipe tied to a specific model family.

Traditional diffusion LLMs generate text by starting with masked positions and progressively unmasking tokens in discrete steps. Each step commits to specific tokens, and parallel decoding strategies attempt to predict multiple positions simultaneously, which risks ignoring inter-token dependencies. DMax’s Soft Parallel Decoding changes the unit of iteration from the token to the embedding. Instead of flipping masks to tokens, SPD treats the sequence as a continuous field that undergoes iterative self-refinement. The model gradually sharpens embedding representations across the full span until they converge to discrete tokens. This reframing is why DMax can decode in parallel without the quality collapse that ParallelBench³ documented in prior approaches.

The Numbers: Where DMax Wins and Where It Holds Steady

On LLaDA-2.0-mini with DMax applied, GSM8K tokens-per-frame rise from 2.04 to 5.48 while accuracy holds at 92.1%, a negligible drop from the 92.6% baseline.¹ MBPP improves from 2.71 to 5.86 TPF at 79.2% accuracy, down slightly from 80.6%.¹ HumanEval-Instruct reaches 7.36 TPF at 1,557 TPS.¹ The headline 1,338 tok/s figure is a peak;¹ the paper reports a range of 1,258 to 1,557 TPS depending on the benchmark.¹

With the decoding temperature τ_dec set to 0, baseline LLaDA-2.0-mini collapses to 15.2% on MATH500 and 2.3% on MBPP.¹ DMax maintains 71.6% and 79.2% respectively.¹

Why ParallelBench’s Critique Now Has a Concrete Counter-Example

ParallelBench,³ accepted at ICLR 2026, evaluated parallel decoding across 17 tasks and concluded that current strategies fail to achieve speedup without compromising quality. The authors argued that parallel dLLM decoding ignores token dependencies in ways that fundamentally limit its usefulness. DMax does not refute ParallelBench’s methodology, but it provides a specific counter-example: on LLaDA-2.0-mini, Soft Parallel Decoding achieves substantial tokens-per-frame gains while keeping accuracy within one to two percentage points of the autoregressive baseline. The critique is not invalidated, but it is no longer the last word. The burden of proof has shifted to reproducing these results at larger scales and on independent benchmarks.

The Infrastructure Angle: Can Your Serving Stack Handle Span-Wise Decoding?

For teams running prefill-decode-disaggregated stacks on vLLM or SGLang, DMax raises an architectural question that is more urgent than the exact tok/s number. Autoregressive serving engines assume left-to-right token emission. Their schedulers, continuous batching logic, and KV-cache allocators are built around the invariant that each forward pass appends one token (or a small speculative bundle) to a growing prefix. DMax’s span-wise iterative refinement violates that invariant. A full sequence of embeddings is refined in parallel across multiple steps, with no clear analogue to the prefill-decode boundary that disaggregated systems use to separate prompt processing from token generation.

SGLang⁴ currently offers the most mature dLLM serving path, having shipped block-wise dLLM support for LLaDA 2.0 with KV cache, CUDA graph optimization, and threshold-based parallel decoding. In a December 2025 demo,⁵ SGLang achieved 935 tok/s on quicksort⁵ and 500 TPS sustained⁵ with CAP training, claiming up to 1.9× speedup over comparable autoregressive baselines. Its 2026 S1 roadmap lists non-block dLLMs, request early exit, and disaggregation via AFD as pending. vLLM, by contrast, lacks native non-block dLLM scheduling.

What to Watch: Larger Models, Batching, and Scheduler Redesign

Three uncertainties stand between DMax’s research result and production relevance. First, scaling: the v3 paper evaluates only LLaDA-2.0-mini. Whether OPUT stabilizes training and SPD maintains its speed-quality profile on 70B-parameter models or mixture-of-experts architectures is unverified.¹ Second, batching: the reported throughput is at batch=1.¹ Continuous batching and dynamic batching behavior for span-wise refinement are not characterized, and memory allocation patterns for iterative embedding updates differ from autoregressive KV-cache growth. Third, scheduler assumptions: prefill-decode disaggregation, now standard for large autoregressive deployments, may not translate directly to diffusion models that refine entire spans. SGLang’s roadmap acknowledges this by planning disaggregation via AFD rather than porting the prefill-decode split. Platform teams should expect that accommodating high-throughput dLLM serving will require re-examining scheduler design and memory allocators from first principles, not just swapping in a new kernel.

Frequently Asked Questions

How does DMax’s SPD differ from SGLang’s existing parallel decoding for LLaDA 2.0?

SGLang’s shipping implementation uses CAP training with a 0.95 confidence threshold decoder that commits tokens once they cross a probability boundary. DMax’s SPD replaces binary threshold gating with iterative refinement of continuous embeddings across the entire span, which requires full model retraining via OPUT rather than a serving-layer configuration change. The two approaches target the same bottleneck but at different abstraction levels.

What’s the practical barrier to running DMax on an existing vLLM deployment?

DMax’s inference path depends on sglang==0.5.3.post1 and does not integrate with vLLM’s scheduler, which lacks native non-block dLLM scheduling. Teams currently running vLLM for autoregressive serving would need to operate a parallel SGLang instance pinned to that specific version rather than adding DMax as a backend to their existing stack.

Do the DMax-Math-16B and DMax-Coder-16B specialist models carry the same throughput claims?

All benchmark tables and the 1,338 tok/s figure in the v3 paper are measured on LLaDA-2.0-mini exclusively. The 16B specialist models have publicly released weights and training code, but the paper does not report throughput or accuracy for them, leaving their production performance uncharacterized.

Autoregressive serving allocates KV-cache incrementally as each token is appended, so memory grows roughly linearly with generated sequence length. SPD allocates the full sequence embedding matrix upfront and refines it across multiple passes, meaning peak memory is committed before the first token is finalized. This changes both the memory ceiling and the fragmentation profile that continuous batching schedulers depend on.

Why does SGLang plan dLLM disaggregation via AFD instead of the standard prefill-decode split?

Diffusion models lack a clean prefill-decode boundary because every refinement step processes the full sequence simultaneously rather than transitioning from prompt processing to token-by-token generation. AFD (Asynchronous Factored Decoding) factors the refinement loop across hardware without requiring the sequential split that autoregressive models provide, meaning existing prefill-decode disaggregation infrastructure cannot be reused for dLLM serving without architectural redesign.