SOAR (Self-Correction for Optimal Alignment and Refinement), published by Tencent’s Hunyuan team on April 14, 2026, replaces the standard SFT post-training stage for diffusion models. It trains the denoiser to recover from the off-trajectory states it actually encounters at inference — no reward model, no preference labels, no detect-and-retry loop. On SD3.5-Medium, this yields an 11% relative GenEval improvement over a matched SFT baseline at identical training cost1.
What Exposure Bias Actually Means for Diffusion Models
The problem SOAR addresses has a name that undersells its practical consequences: exposure bias. During supervised fine-tuning, a diffusion model is trained on clean, ground-truth forward-noised states — the ideal denoising trajectory you would get from a perfect solver. At inference, however, the model generates its own intermediate states, and those states inevitably diverge from the training distribution. The model then has to denoise inputs it was never trained on, at every single step1.
In a 40-step denoising chain, a small deviation at step five compounds across the remaining 35 steps. Standard SFT has no mechanism to correct this drift, because during training it never sees the off-trajectory inputs the model actually produces.
Prior approaches attacked exposure bias through reward-based reinforcement learning or preference-pair optimization. Both require either a trained reward model or human- and AI-labeled preference data — operational overhead that many teams cannot afford at the cadence image pipelines demand.
How SOAR Works: Stop-Gradient Rollout, Re-Noising, and Per-Timestep Correction
SOAR’s mechanism is compact. For each training example12:
- Rollout: Run a forward pass on the current model weights to generate an intermediate off-trajectory state. The rollout is stop-gradient — no backprop flows through it.
- Re-noise: Take that off-trajectory state and add noise, creating a corrupted version that reflects where the model actually lands mid-denoising.
- Supervise correction: Train the model to denoise this re-noised, off-trajectory state back toward the original clean target.
The result is dense per-timestep supervision. Rather than optimizing a single endpoint loss (SFT) or a sparse reward signal (RL), SOAR corrects the denoiser at every auxiliary point along the trajectory — six auxiliary points in the default configuration, over 40 sampling steps3.
Crucially, SOAR’s loss function mathematically subsumes the standard SFT objective. Switching from SFT to SOAR requires no architecture changes, no new data, and no reward infrastructure: it is a direct drop-in replacement for Stage 1 post-training1.
The Numbers: GenEval, OCR, Aesthetic, and Preference Metrics on SD3.5-M
On SD3.5-Medium over 10,000 training steps, SOAR achieves the following over the SFT baseline2:
| Metric | SFT Baseline | SOAR | Relative Change |
|---|---|---|---|
| GenEval (overall) | 0.70 | 0.78 | +11.4% |
| OCR Score | 0.64 | 0.67 | +4.7% |
| PickScore | 22.71 | 22.86 | +0.7% |
| HPSv2.1 | 0.284 | 0.289 | +1.8% |
| Aesthetic Score | 5.35 | 5.46 | +2.1% |
| ImageReward | 1.04 | 1.09 | +4.8% |
GenEval is a composite multi-object prompt-adherence benchmark; the overall score is what SOAR reports. This is a different measurement from the compositional sub-task scores that Flow-GRPO reports. Those two numbers measure related but non-identical things and cannot be stacked into a single ranking without heavy qualification (see FAQ).
On a curated high-aesthetic subset of 3,725 image pairs (aesthetic score ≥ 6.8 at selection time), SOAR achieves an aesthetic score of 5.94 versus Flow-GRPO’s 5.87 and SFT’s 5.742. In a separate experiment on a high-ClipScore subset of 6,857 pairs (ClipScore ≥ 0.40), SOAR scores 0.300 versus SFT’s 0.297 and Flow-GRPO’s 0.2962. Flow-GRPO uses explicit reward signals to optimize aesthetic quality; SOAR matches and slightly exceeds it on both subsets without any reward model.
SOAR vs. Flow-GRPO vs. LeapAlign: Where It Fits in the Post-Training Stack
Three methods now occupy distinct positions in the diffusion post-training space as of April 2026:
Flow-GRPO (NeurIPS 2025)4 applies group-based policy gradient with ODE-to-SDE conversion. It requires a reward model and generates groups of candidate outputs to compute relative advantage — closest in spirit to RLHF in the LLM world. High ceiling on alignment quality, higher operational cost.
LeapAlign (April 2026)5 backpropagates reward gradients through shortened two-step “leap trajectories” rather than full denoising chains. It achieves HPSv2.1 of 0.4092 and GenEval of 0.7420 according to its paper, with a different baseline set than SOAR’s experiments. Like Flow-GRPO, it requires a differentiable reward signal.
SOAR (April 2026)1 requires neither a reward model nor preference pairs. Its loss subsumes SFT, making it the lowest-friction entry in the stack. The paper notes compatibility with subsequent reward-based alignment, positioning SOAR as Stage 1 — after which Flow-GRPO or a similar RL step can optionally be layered as Stage 2.
| Method | Reward Model Required | Preference Labels Required | Typical Stage |
|---|---|---|---|
| Standard SFT | No | No | Stage 1 |
| SOAR | No | No | Stage 1 (replaces SFT) |
| Flow-GRPO | Yes | No | Stage 2 |
| LeapAlign | Yes (differentiable) | No | Stage 2 |
| DPO variants | No | Yes | Stage 1/2 |
DPO variants appear for completeness. The brief does not include DPO benchmark data on SD3.5-M under equivalent conditions, so no numeric comparison is presented.
Practitioner Decision Guide: When to Use SOAR, Hardware Requirements, and Pipeline Integration
For teams running SD3.5-based image generation pipelines — ad creative, product photography, synthetic data — the concrete SOAR offer as of April 2026 is:
- No new data: SOAR uses existing SFT training samples. The 10k-step benchmark represents a realistic fine-tuning budget for most teams.
- No reward infrastructure: No reward model to train, serve, or maintain.
- Hardware: Single-node 8 GPUs, per-GPU batch size 4, global batch size 323. The ODE-only rollout variant halves the auxiliary compute overhead relative to the full method.
- Training duration: 5,000–10,000 steps in the default configuration3.
- Integration: Drop-in replacement for the Stage 1 SFT step. Training code is open-source under the Tencent-Hunyuan GitHub organization3.
Teams that already operate a reward model for RL alignment can treat SOAR as a Stage 1 foundation, then apply reward-based alignment on top. The paper asserts this stacking remains viable1.
What Isn’t Tested Yet: Video, Flux, Diversity Degradation, and the Road to Inference-Time Correction
The paper’s authors identify three limitations explicitly2:
Diversity narrowing: SOAR’s per-timestep correction supervision could narrow output diversity at high-noise stages, where the model is guided back toward a specific training target. The authors acknowledge this risk but include no quantitative evaluation — no LPIPS measurement, no Vendi Score. This is a self-reported concern, not a confirmed finding. Do not assume SOAR degrades diversity, and do not assume it does not; the data does not yet exist.
Fixed noise weighting: The noise weighting schedule w(σ) is inherited from base flow matching rather than optimized for the correction objective. This is a known limitation that future work may address.
Evaluation scope: All benchmarks reported as of April 22, 2026 are on SD3.5-Medium at 512×512 resolution. There is no published assessment on SDXL, Flux, video generation, 3D synthesis, or model distillation. Whether the exposure-bias correction mechanism transfers cleanly to these settings is unknown.
The distinction from inference-time correction also matters for pipeline planning. Encoding correction into weights during training is efficient at inference but static: an SOAR-trained model cannot adapt its correction strategy to novel prompt distributions at serving time without retraining. Inference-time self-correction for diffusion models — generate, evaluate, revise — remains an open research problem distinct from what SOAR addresses.
FAQ
Does SOAR require preference-labeled data, as DPO methods do?
No. SOAR is entirely reward- and preference-label-free1. The correction supervision derives from the model’s own rollout, compared against the original clean training target. Multiple AI-generated secondary summaries of the paper incorrectly describe SOAR as using preference pairs; the actual paper contains no such mechanism.
Can SOAR be applied to Flux or SDXL today?
Not with published benchmarks. The paper evaluates exclusively on SD3.5-Medium at 512×5122. Training code is publicly available for experimentation, but results on other architectures have not been reported as of April 22, 2026.
Flow-GRPO’s compositional sub-task score went from 0.63 to 0.95; SOAR’s overall GenEval went from 0.70 to 0.78. Does that mean Flow-GRPO is better on composition?
These numbers are not directly comparable. Flow-GRPO’s 0.63→0.95 is a compositional sub-task accuracy, not the same metric as SOAR’s overall GenEval score4. A valid comparison would require both methods evaluated on the same benchmark variant under the same conditions. On the high-aesthetic subset where both are reported side-by-side, SOAR’s aesthetic score (5.94) narrowly exceeds Flow-GRPO’s (5.87)2 — though that subset targets aesthetic quality, not compositional accuracy.
Footnotes
-
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models, arXiv
.12617, Tencent Hunyuan team, April 14, 2026. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 -
SOAR full paper (HTML), arXiv
.12617. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 -
HY-SOAR GitHub repository, Tencent-Hunyuan organization, accessed April 22, 2026. ↩ ↩2 ↩3 ↩4
-
Flow-GRPO: Training Flow Matching Models via Online RL, arXiv
.05470, NeurIPS 2025. ↩ ↩2 -
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories, arXiv
.15311, April 2026. ↩