FP4 quantization on NVIDIA Blackwell hardware is fast in theory and fragile in practice: a handful of activation outliers can inflate the shared block scale factor enough to crush the effective dynamic range for every other element in that group. DuQuant++ (arXiv 2604.17789) attacks that fragility directly, adapting outlier-aware rotation to fit MXFP4’s fixed 32-element block structure and halving the online rotation cost relative to its predecessor in the process. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789)
Why MXFP4 Is Uniquely Sensitive to Outliers
MXFP4 — the microscaling 4-bit format with native support on Blackwell Tensor Cores — partitions tensors into blocks of 32 elements, each sharing a single E8M0 scaling factor. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) That shared scale is the vulnerability. When one or two activation channels carry values an order of magnitude larger than their neighbors, the block-level scale inflates to cover them, and the 4-bit mantissa can no longer resolve the smaller values with any precision. The result is not a graceful accuracy loss; on models like LLaMA 3.2-3B, naive quantization with QuaRot degrades WikiText2 perplexity from 7.81 to 17.95 — a 130% increase. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789)
The standard counter-move — Hadamard or random orthogonal rotation — redistributes outlier magnitude across channels before quantization. The problem is that existing rotation methods are data-agnostic: they spread the energy uniformly rather than targeting the specific channels where outliers actually concentrate. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) That imprecision is acceptable at 8-bit but not at 4-bit, where every bit of representable range is load-bearing.
How DuQuant++ Rethinks the Rotation
The original DuQuant used two sequential rotations plus a zigzag permutation to deal with cross-block variance: outlier channels that straddle block boundaries under one rotation scheme could be reordered and re-rotated to land inside a single block. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) The mechanism works but compounds online cost — every forward pass that needs activation rotation pays for it twice.
DuQuant++ collapses this to a single rotation by exploiting the one structural fact that makes MXFP4 different from FP8 or INT8: each block has its own independent scaling factor. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) Because each 32-element group rescales independently, the cross-block leakage problem that forced dual rotations in DuQuant is substantially reduced. Aligning the rotation block size to B=32 — MXFP4’s native group size — means a single calibrated rotation can suppress outliers within the granularity that actually matters for the quantizer.
The calibration loop uses a greedy search with up to 128 steps over 128 WikiText2 samples at sequence length 2048. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) That is a one-time offline cost. The runtime overhead is a single matrix multiply per layer that needs activation rotation, not two.
What the Numbers Show
On LLaMA 3-8B under MXFP4 W4A4, DuQuant++ reaches 6.88 WikiText2 perplexity and 67.1% average zero-shot accuracy across seven QA benchmarks (ARC-E, ARC-C, HellaSwag, WinoGrande, LAMBADA, PIQA, OpenBookQA). (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) The FP16 baseline sits at 6.14 perplexity and 69.1% accuracy — a gap of 0.74 perplexity points and 2.0 percentage points that represents the irreducible cost of W4A4 at this model scale. Against the closest prior method, MR-GPTQ, DuQuant++ gains 0.41 perplexity points and 1.0% accuracy. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789)
The signal is clearer on the smaller model. LLaMA 3.2-3B is harder to quantize — smaller models tend to have less redundancy to absorb quantization error — and the outlier problem hits harder. DuQuant++ reaches 8.63 WikiText2 perplexity versus QuaRot’s 17.95, a roughly 50% relative improvement. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) The FP16 baseline is 7.81, so DuQuant++ closes most of the gap that QuaRot leaves open.
Instruction-tuned variants follow a similar pattern: LLaMA 3-8B-Instruct reaches 8.75 perplexity and 65.9% average accuracy; LLaMA 3.1-8B-Instruct reaches 7.89 perplexity and 67.4%. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) The paper does not report results for 70B-scale models in the abstract or section headings visible in the HTML, so the inference that this approach scales to 70B+ models is reasonable but not directly benchmarked in this preprint. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789)
The Calibration Cost Tradeoff
DuQuant++ shifts cost from runtime to calibration. The 128-sample greedy search with up to 128 steps is not expensive by training standards, but it is not zero — and it is per-model, not per-deployment. Teams that want to run a quantized LLaMA 70B on Blackwell need to run calibration once per checkpoint, on a calibration dataset that is at least somewhat representative of their workload.
That is a different engineering problem than “can we fit this model in memory.” The memory question is resolved by the quantization itself: W4A4 at FP4 cuts activation memory roughly 4× relative to FP16. The calibration question is about tooling: does the team have a pipeline to produce and validate quantized checkpoints, and does the calibration dataset drift enough from production queries to make the perplexity numbers misleading?
Implications for Blackwell Deployments
Blackwell’s FP4 Tensor Core support makes W4A4 a viable inference path for the first time in a production GPU generation. (DuQuant++: Outlier-Aware Fine-Grained Rotation for MXFP4 Quantization. arXiv 2604.17789) The question DuQuant++ answers is whether accuracy can be preserved well enough to make that path useful, not just theoretically possible. The LLaMA 3-8B results — 0.74 perplexity points above FP16, 2.0 percentage points below on zero-shot accuracy — suggest the answer is yes for general workloads, with caveats for smaller models and domain-specific applications where the calibration distribution matters more.
The second-order consequence is that the bottleneck for deploying large models on FP4 hardware shifts from memory to calibration engineering. Memory capacity is a hardware constant; calibration pipelines are a software and organizational problem. Teams that already have post-training quantization workflows can likely adapt them. Teams that have been deploying W8A8 or W4A8 models without calibration-aware tooling face a larger retooling cost before they can capture the memory savings that FP4 offers at 70B+ scale.
The preprint arrived on 20 April 2026, two days before this article. Independent replication of the benchmark numbers has not yet been reported.
Frequently Asked Questions
Can DuQuant++ be applied to non-LLaMA architectures?
The paper evaluates only the LLaMA 3 family. Because the method operates on activation distributions rather than model-specific structural features, it should generalize to other transformer architectures — but the calibration step would need to be rerun, and the quality of the learned rotation depends on how similar the outlier channel structure is to LLaMA’s. No cross-architecture benchmarks have been published.
What practical difference does halving the online rotation cost make?
At low batch sizes the rotation matrix multiply is a small fraction of per-token latency. The benefit appears at higher throughput or longer context lengths, where the original DuQuant’s dual rotation compounds across layers and tokens. A single rotation also simplifies kernel integration with the quantization step.
How long does calibration take, and what does a team need?
The paper uses 128 WikiText2 samples at sequence length 2048 with up to 128 greedy search steps — completing in minutes for an 8B model on a single GPU. The inputs are a pre-trained FP16 checkpoint and a calibration dataset; the output is a rotation matrix per layer. The workflow resembles existing post-training quantization pipelines like GPTQ rather than requiring new infrastructure.
When should teams choose FP8 over FP4 quantization instead?
FP8 with W8A8 is the safer default: no outlier-aware rotation is needed, accuracy loss is minimal, and the calibration complexity is avoided entirely. FP4 with DuQuant++ makes sense when memory pressure is the binding constraint — fitting a larger model on limited GPU memory — and the team can accept the 1–2 percentage point accuracy gap. For models under 8B parameters, the accuracy penalty from W4A4 may not justify the memory savings.
What does a team need for FP4 deployment that they don’t need for FP8?
FP8 quantization is largely plug-and-play on supported hardware. FP4 with DuQuant++ introduces a per-checkpoint calibration step that must be rerun for every model variant, including fine-tuned versions. Teams need a reproducible calibration workflow, a representative calibration dataset, and a validation step that checks perplexity and task accuracy against an FP16 baseline — effectively a post-training engineering function rather than a deployment toggle.