DuQuant++ Makes FP4 Quantization Practical for LLM Inference: What Fine-Grained Rotation Means for Blackwell Deployments

DuQuant++ closes the W4A4 MXFP4 accuracy gap on LLaMA-3-8B to 0.74 WikiText2 perplexity points and 2 percentage points on downstream tasks relative to FP16 (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv abstract)) — not by changing the quantization grid, but by aligning the rotation preprocessing step to the MXFP4 group size. With Blackwell GPUs shipping native FP4 Tensor Cores, this matters: the objection to FP4 in production is shifting from “the accuracy loss is too large” to “which rotation scheme do inference frameworks standardize on.”

The MXFP4 Outlier Trap: Why a Single Bad Channel Ruins the Block

MXFP4 groups activations into blocks of 32 values and applies a shared scale per block. The format’s dynamic range is roughly 16x narrower than FP8. For smooth weight distributions, that tradeoff is manageable. For the activation outliers that large transformers reliably produce — channels where values sit an order of magnitude above their neighbors — it is destructive. A single outlier forces the group scale up, clipping everything else in the block.

The original DuQuant (NeurIPS 2024 Oral (DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs (arXiv abstract))) addressed this with dual rotations: one to redistribute outliers across channels, plus a zigzag permutation to spread residual spikes. The method worked, but it left a structural mismatch between the rotation granularity and the MXFP4 group size. DuQuant++ closes that gap.

How DuQuant++ Simplifies the Rotation Pipeline (B=32 Alignment)

The core change is setting the rotation block size B=32 to match the MXFP4 microscaling group size (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv abstract)). Because rotation and quantization now operate at the same granularity, a single outlier-aware rotation per block is sufficient — replacing DuQuant’s dual-rotation-plus-zigzag pipeline with one transformation.

That alignment halves the online rotation cost compared to the original method (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv abstract)). The reduction is in preprocessing compute, not in end-to-end inference throughput on Blackwell hardware; the paper does not report throughput benchmarks on actual silicon. But a lower per-block rotation cost matters at high batch sizes and long contexts, where the transformation runs over substantially more data.

Fewer transformation steps also reduce the surface area for calibration bugs and simplify integration with weight correction methods like GPTQ. The cleaner pipeline is a practical argument for DuQuant++ beyond the accuracy numbers alone.

Benchmarks: LLaMA-3 at W4A4 (Perplexity and Downstream Accuracy)

The paper benchmarks MXFP4 W4A4 (4-bit weights, 4-bit activations) on LLaMA-3 family models against FP16 baselines and prior rotation methods (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv HTML with tables)).

Model	Method	WikiText2 PPL	C4 PPL	Avg Accuracy
LLaMA-3-8B	FP16	6.14	9.46	69.1%
LLaMA-3-8B	DuQuant++*	6.88	11.06	67.1%
LLaMA-3.2-3B	QuaRot	17.95	—	—
LLaMA-3.2-3B	DuQuant++*	8.63	—	—

* DuQuant++* denotes the method combined with GPTQ weight compensation.

The asterisk matters. The headline perplexity numbers require GPTQ post-hoc weight correction at calibration time (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv HTML with tables)). GPTQ adds no inference overhead, but it requires a calibration dataset and corrected weight storage per model variant. Teams running rotation-only (no GPTQ) will see higher perplexity; those results exist in the paper but are not the headline figures.

The LLaMA-3.2-3B comparison against QuaRot is the sharper result: 8.63 vs. 17.95 on WikiText2 (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv HTML with tables)) is more than a 50% relative perplexity improvement, suggesting alignment-based rotation is particularly valuable at smaller model scales where fixed-granularity methods struggle more with outlier concentration.

One gap in the evaluation: the paper does not benchmark against FP8 directly. The inference that W4A4 MXFP4 accuracy is “close to FP8” follows from how small the FP16 gap is — not from a head-to-head comparison. That inference is plausible given the numbers, but it is not in the data.

What Blackwell’s Native FP4 Path Means for Hopper Shops

Blackwell’s FP4 Tensor Cores use micro-tensor scaling, which NVIDIA describes as doubling the performance and model capacity that memory can support while maintaining accuracy (NVIDIA Blackwell Architecture). Blackwell Ultra extends that further — 2x attention-layer acceleration and 1.5x more AI compute FLOPS versus standard Blackwell, according to NVIDIA’s architecture documentation (NVIDIA Blackwell Architecture). Those figures come from NVIDIA’s own materials and have not been independently replicated in peer-reviewed benchmarks as of April 2026.

For teams currently running INT8 or FP8 on Hopper, the native FP4 path only becomes meaningful when the hardware is Blackwell. Hopper has no native MXFP4 support; running FP4 there requires software emulation at full cost, which eliminates the memory and bandwidth savings entirely. The migration is a hardware question as much as a software one.

But the calibration and validation work — establishing rotation pipelines, GPTQ calibration datasets, model-specific perplexity baselines — is hardware-independent and can start now. The paper’s April 20, 2026 submission (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv abstract)) lands as Blackwell begins shipping; teams waiting until the hardware is in production to begin validation work are starting late.

Migration Checklist: Rotation Overhead, GPTQ Compensation, and Model Coverage

Rotation overhead: The online rotation runs per-inference-pass and is halved relative to DuQuant (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv abstract)), but nonzero. Profile it at your serving batch size. At small batch sizes — interactive or low-concurrency inference — the rotation overhead consumes a larger fraction of total compute than at large batch, and the math changes.

GPTQ compensation: If the headline accuracy numbers are the target, GPTQ is required (DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (arXiv HTML with tables)). That means a calibration run, a representative dataset, and corrected weights for each model variant. Running rotation-only is an option; the accuracy gap to FP16 will be wider, and teams should measure rather than estimate how much wider on their specific model.

Model coverage: LLaMA-3-8B and LLaMA-3.2-3B are covered in the paper. Larger models (LLaMA-3-70B) and MoE architectures are not. The B=32 alignment is theoretically architecture-agnostic, but outlier distributions differ enough across model families that perplexity gaps will not transfer directly from the paper’s numbers to an arbitrary target model.

Standardization risk: DuQuant++ is not the only rotation-based FP4 approach. QuaRot and SpinQuant occupy the same space, and the paper shows DuQuant++ ahead on LLaMA-3. Which method inference framework operators converge on is not determined by accuracy numbers alone — it follows from kernel availability, integration complexity, and licensing. A team standardizing on DuQuant++ today is betting on its framework trajectory, not just its current benchmark position.

Frequently Asked Questions

Does DuQuant++ work on MoE architectures like DeepSeek or Mixtral?

No. The paper only evaluates LLaMA-3 family dense models and does not test MoE architectures. Whether B=32 alignment transfers cleanly to MoE models is an open question, since their activation patterns tend toward sparser but more extreme outliers.

How does DuQuant++ differ from the original DuQuant method?

DuQuant++ sets the rotation block size B=32 to match the MXFP4 microscaling group size, replacing the original DuQuant’s dual rotations and zigzag permutation with a single outlier-aware rotation per block. This alignment halves the online rotation cost.

What do teams need to implement to match the headline accuracy numbers?

The headline perplexity numbers require GPTQ post-hoc weight correction at calibration time, which needs a representative calibration dataset and corrected weights for each model variant. Running rotation-only without GPTQ yields higher perplexity.

Does the paper benchmark FP4 against FP8 directly?

No. The inference that W4A4 MXFP4 accuracy is close to FP8 follows from how small the gap to FP16 is, not from a head-to-head comparison. That extrapolation is plausible but not present in the data.

Can Hopper GPUs run DuQuant++ with FP4?

Hopper has no native MXFP4 support, so running FP4 there requires software emulation that eliminates the memory and bandwidth savings entirely. The native FP4 path is only meaningful on Blackwell hardware.

DuQuant++ Makes FP4 Quantization Practical for LLM Inference: What Fine-Grained Rotation Means for Blackwell Deployments

The MXFP4 Outlier Trap: Why a Single Bad Channel Ruins the Block

How DuQuant++ Simplifies the Rotation Pipeline (B=32 Alignment)

Benchmarks: LLaMA-3 at W4A4 (Perplexity and Downstream Accuracy)

What Blackwell’s Native FP4 Path Means for Hopper Shops

Migration Checklist: Rotation Overhead, GPTQ Compensation, and Model Coverage

Frequently Asked Questions

Does DuQuant++ work on MoE architectures like DeepSeek or Mixtral?

How does DuQuant++ differ from the original DuQuant method?

What do teams need to implement to match the headline accuracy numbers?

Does the paper benchmark FP4 against FP8 directly?

Can Hopper GPUs run DuQuant++ with FP4?

Sources

Enjoyed this article?

The MXFP4 Outlier Trap: Why a Single Bad Channel Ruins the Block

How DuQuant++ Simplifies the Rotation Pipeline (B=32 Alignment)

Benchmarks: LLaMA-3 at W4A4 (Perplexity and Downstream Accuracy)

What Blackwell’s Native FP4 Path Means for Hopper Shops

Migration Checklist: Rotation Overhead, GPTQ Compensation, and Model Coverage

Frequently Asked Questions

Does DuQuant++ work on MoE architectures like DeepSeek or Mixtral?

How does DuQuant++ differ from the original DuQuant method?

What do teams need to implement to match the headline accuracy numbers?

Does the paper benchmark FP4 against FP8 directly?

Can Hopper GPUs run DuQuant++ with FP4?

Sources

Related Articles

DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU

Running DeepSeek R1 Locally: Hardware Requirements, Quantization, and Real Throughput

Fixed Entropy Coefficients Break Down on Mixed-Difficulty Tasks: What AER Means for Teams Running LLM RL at Scale

Enjoyed this article?