Self-Correction Comes to Diffusion Models: What SOAR Means for Iterative Image Generation Pipelines

SOAR (Self-Correction for Optimal Alignment and Refinement), published by Tencent’s Hunyuan team on April 14, 2026, replaces the standard SFT post-training stage for diffusion models. It trains the denoiser to recover from the off-trajectory states it actually encounters at inference — no reward model, no preference labels, no detect-and-retry loop. On SD3.5-Medium, this yields an 11% relative GenEval improvement over a matched SFT baseline at identical training cost¹.

What Exposure Bias Actually Means for Diffusion Models

The problem SOAR addresses has a name that undersells its practical consequences: exposure bias. During supervised fine-tuning, a diffusion model is trained on clean, ground-truth forward-noised states — the ideal denoising trajectory you would get from a perfect solver. At inference, however, the model generates its own intermediate states, and those states inevitably diverge from the training distribution. The model then has to denoise inputs it was never trained on, at every single step¹.

In a 40-step denoising chain, a small deviation at step five compounds across the remaining 35 steps. Standard SFT has no mechanism to correct this drift, because during training it never sees the off-trajectory inputs the model actually produces.

Prior approaches attacked exposure bias through reward-based reinforcement learning or preference-pair optimization. Both require either a trained reward model or human- and AI-labeled preference data — operational overhead that many teams cannot afford at the cadence image pipelines demand.

How SOAR Works: Stop-Gradient Rollout, Re-Noising, and Per-Timestep Correction

SOAR’s mechanism is compact. For each training example¹²:

Rollout: Run a forward pass on the current model weights to generate an intermediate off-trajectory state. The rollout is stop-gradient — no backprop flows through it.
Re-noise: Take that off-trajectory state and add noise, creating a corrupted version that reflects where the model actually lands mid-denoising.
Supervise correction: Train the model to denoise this re-noised, off-trajectory state back toward the original clean target.

The result is dense per-timestep supervision. Rather than optimizing a single endpoint loss (SFT) or a sparse reward signal (RL), SOAR corrects the denoiser at every auxiliary point along the trajectory — six auxiliary points in the default configuration, over 40 sampling steps³.

Crucially, SOAR’s loss function mathematically subsumes the standard SFT objective. Switching from SFT to SOAR requires no architecture changes, no new data, and no reward infrastructure: it is a direct drop-in replacement for Stage 1 post-training¹.

The Numbers: GenEval, OCR, Aesthetic, and Preference Metrics on SD3.5-M

On SD3.5-Medium over 10,000 training steps, SOAR achieves the following over the SFT baseline²:

Metric	SFT Baseline	SOAR	Relative Change
GenEval (overall)	0.70	0.78	+11.4%
OCR Score	0.64	0.67	+4.7%
PickScore	22.71	22.86	+0.7%
HPSv2.1	0.284	0.289	+1.8%
Aesthetic Score	5.35	5.46	+2.1%
ImageReward	1.04	1.09	+4.8%

GenEval is a composite multi-object prompt-adherence benchmark; the overall score is what SOAR reports. This is a different measurement from the compositional sub-task scores that Flow-GRPO reports. Those two numbers measure related but non-identical things and cannot be stacked into a single ranking without heavy qualification (see FAQ).

On a curated high-aesthetic subset of 3,725 image pairs (aesthetic score ≥ 6.8 at selection time), SOAR achieves an aesthetic score of 5.94 versus Flow-GRPO’s 5.87 and SFT’s 5.74². In a separate experiment on a high-ClipScore subset of 6,857 pairs (ClipScore ≥ 0.40), SOAR scores 0.300 versus SFT’s 0.297 and Flow-GRPO’s 0.296². Flow-GRPO uses explicit reward signals to optimize aesthetic quality; SOAR matches and slightly exceeds it on both subsets without any reward model.

SOAR vs. Flow-GRPO vs. LeapAlign: Where It Fits in the Post-Training Stack

Three methods now occupy distinct positions in the diffusion post-training space as of April 2026:

Flow-GRPO (NeurIPS 2025)⁴ applies group-based policy gradient with ODE-to-SDE conversion. It requires a reward model and generates groups of candidate outputs to compute relative advantage — closest in spirit to RLHF in the LLM world. High ceiling on alignment quality, higher operational cost.

LeapAlign (April 2026)⁵ backpropagates reward gradients through shortened two-step “leap trajectories” rather than full denoising chains. It achieves HPSv2.1 of 0.4092 and GenEval of 0.7420 according to its paper, with a different baseline set than SOAR’s experiments. Like Flow-GRPO, it requires a differentiable reward signal.

SOAR (April 2026)¹ requires neither a reward model nor preference pairs. Its loss subsumes SFT, making it the lowest-friction entry in the stack. The paper notes compatibility with subsequent reward-based alignment, positioning SOAR as Stage 1 — after which Flow-GRPO or a similar RL step can optionally be layered as Stage 2.

Method	Reward Model Required	Preference Labels Required	Typical Stage
Standard SFT	No	No	Stage 1
SOAR	No	No	Stage 1 (replaces SFT)
Flow-GRPO	Yes	No	Stage 2
LeapAlign	Yes (differentiable)	No	Stage 2
DPO variants	No	Yes	Stage 1/2

DPO variants appear for completeness. The brief does not include DPO benchmark data on SD3.5-M under equivalent conditions, so no numeric comparison is presented.

Practitioner Decision Guide: When to Use SOAR, Hardware Requirements, and Pipeline Integration

For teams running SD3.5-based image generation pipelines — ad creative, product photography, synthetic data — the concrete SOAR offer as of April 2026 is:

No new data: SOAR uses existing SFT training samples. The 10k-step benchmark represents a realistic fine-tuning budget for most teams.
No reward infrastructure: No reward model to train, serve, or maintain.
Hardware: Single-node 8 GPUs, per-GPU batch size 4, global batch size 32³. The ODE-only rollout variant halves the auxiliary compute overhead relative to the full method.
Training duration: 5,000–10,000 steps in the default configuration³.
Integration: Drop-in replacement for the Stage 1 SFT step. Training code is open-source under the Tencent-Hunyuan GitHub organization³.

Teams that already operate a reward model for RL alignment can treat SOAR as a Stage 1 foundation, then apply reward-based alignment on top. The paper asserts this stacking remains viable¹.

What Isn’t Tested Yet: Video, Flux, Diversity Degradation, and the Road to Inference-Time Correction

The paper’s authors identify three limitations explicitly²:

Diversity narrowing: SOAR’s per-timestep correction supervision could narrow output diversity at high-noise stages, where the model is guided back toward a specific training target. The authors acknowledge this risk but include no quantitative evaluation — no LPIPS measurement, no Vendi Score. This is a self-reported concern, not a confirmed finding. Do not assume SOAR degrades diversity, and do not assume it does not; the data does not yet exist.

Fixed noise weighting: The noise weighting schedule w(σ) is inherited from base flow matching rather than optimized for the correction objective. This is a known limitation that future work may address.

Evaluation scope: All benchmarks reported as of April 22, 2026 are on SD3.5-Medium at 512×512 resolution. There is no published assessment on SDXL, Flux, video generation, 3D synthesis, or model distillation. Whether the exposure-bias correction mechanism transfers cleanly to these settings is unknown.

The distinction from inference-time correction also matters for pipeline planning. Encoding correction into weights during training is efficient at inference but static: an SOAR-trained model cannot adapt its correction strategy to novel prompt distributions at serving time without retraining. Inference-time self-correction for diffusion models — generate, evaluate, revise — remains an open research problem distinct from what SOAR addresses.

FAQ

Does SOAR require preference-labeled data, as DPO methods do?

No. SOAR is entirely reward- and preference-label-free¹. The correction supervision derives from the model’s own rollout, compared against the original clean training target. Multiple AI-generated secondary summaries of the paper incorrectly describe SOAR as using preference pairs; the actual paper contains no such mechanism.

Can SOAR be applied to Flux or SDXL today?

Not with published benchmarks. The paper evaluates exclusively on SD3.5-Medium at 512×512². Training code is publicly available for experimentation, but results on other architectures have not been reported as of April 22, 2026.

Flow-GRPO’s compositional sub-task score went from 0.63 to 0.95; SOAR’s overall GenEval went from 0.70 to 0.78. Does that mean Flow-GRPO is better on composition?

These numbers are not directly comparable. Flow-GRPO’s 0.63→0.95 is a compositional sub-task accuracy, not the same metric as SOAR’s overall GenEval score⁴. A valid comparison would require both methods evaluated on the same benchmark variant under the same conditions. On the high-aesthetic subset where both are reported side-by-side, SOAR’s aesthetic score (5.94) narrowly exceeds Flow-GRPO’s (5.87)² — though that subset targets aesthetic quality, not compositional accuracy.

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models, arXiv
.12617, Tencent Hunyuan team, April 14, 2026. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
SOAR full paper (HTML), arXiv
.12617. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
HY-SOAR GitHub repository, Tencent-Hunyuan organization, accessed April 22, 2026. ↩ ↩² ↩³ ↩⁴
Flow-GRPO: Training Flow Matching Models via Online RL, arXiv
.05470, NeurIPS 2025. ↩ ↩²
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories, arXiv
.15311, April 2026. ↩

Self-Correction Comes to Diffusion Models: What SOAR Means for Iterative Image Generation Pipelines

What Exposure Bias Actually Means for Diffusion Models

How SOAR Works: Stop-Gradient Rollout, Re-Noising, and Per-Timestep Correction

The Numbers: GenEval, OCR, Aesthetic, and Preference Metrics on SD3.5-M

SOAR vs. Flow-GRPO vs. LeapAlign: Where It Fits in the Post-Training Stack

Practitioner Decision Guide: When to Use SOAR, Hardware Requirements, and Pipeline Integration

What Isn’t Tested Yet: Video, Flux, Diversity Degradation, and the Road to Inference-Time Correction

FAQ

Sources

Enjoyed this article?

What Exposure Bias Actually Means for Diffusion Models

How SOAR Works: Stop-Gradient Rollout, Re-Noising, and Per-Timestep Correction

The Numbers: GenEval, OCR, Aesthetic, and Preference Metrics on SD3.5-M

SOAR vs. Flow-GRPO vs. LeapAlign: Where It Fits in the Post-Training Stack

Practitioner Decision Guide: When to Use SOAR, Hardware Requirements, and Pipeline Integration

What Isn’t Tested Yet: Video, Flux, Diversity Degradation, and the Road to Inference-Time Correction

FAQ

Footnotes

Sources

Related Articles

DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU

Qwen3.6-27B's Dense Architecture Challenges the MoE-Only Playbook for Flagship-Class Coding Models

Executing Programs Inside Transformers: The Inference Breakthrough Nobody Expected

Enjoyed this article?