groundy
security

Diffusion Model Safety: How Training-Schedule Poisoning Slips Past Prompt Filters

TEMPO-Diffusion gates its backdoor to a training-timestep window, so clean inference output no longer proves a clean checkpoint. Output-only audits miss the poisoning.

9 min···5 sources ↓

TEMPO-Diffusion, an arXiv preprint posted 24 June 2026 by William Aiken and colleagues at the University of Ottawa, slips past prompt filters by construction. It gates its malicious behavior to a training-time exposure window rather than an inference-time noise trigger, so the prompt-inspection and output-sampling layers most teams rely on never see the poisoned behavior activate. That breaks the assumption that a model passing clean validation at inference is itself clean.

What does TEMPO-Diffusion change about diffusion backdoors?

TEMPO-Diffusion moves the backdoor trigger off the inference-time noise seed that prior attacks require and onto a training-time exposure window the attacker schedules. The authors present this as the first diffusion backdoor to localize a malicious distribution shift to a temporal, in-distribution window during training, rather than at inference.

Prior diffusion backdoors all depend on attacker-controlled inference-time input: the attacker supplies a specific noise pattern or seed at sampling time to flip the model’s output. According to the paper’s mechanism description, TEMPO-Diffusion removes that requirement. Every class is exposed to the trigger during training, but only the designated victim class is associated with the malicious objective, leaving non-victim classes to behave cleanly.

That victim-class gating is the substantive novelty. The claim is the combination of in-distribution exposure, victim-class specificity, and training-time scheduling, which together produce clean outputs for every class the attacker did not target. The gating is also what makes the backdoor hard to stumble on at audit time. A sampling audit that does not know the victim class draws from the full class distribution, and the chance of landing on the victim class under the right condition is low enough that the poisoned behavior stays buried across a typical sample budget.

The attack supports several modes: targeted attacks on and to specific classes; multiple sub-image backdoors that reconstruct specific features at different locations across several output images; and in-painting driven by time-conditioned triggers. The arXiv paper lists all three as within the framework’s scope.

Why don’t prompt filters and output-only audits catch timestep-gated poisoning?

A backdoor that activates only inside a training-timestep window is invisible to prompt filters and output-sampling audits, because both operate at inference time, where the poisoned behavior is gated off.

The safety stack most image-model teams have actually built has two inference-time layers. Prompt-side filters intercept banned concepts before generation. Output-side audits sample the checkpoint and look for anomalous or unsafe generations. Neither layer has any view of the training loop, the loss curve, or which timesteps saw the poisoned data. That is precisely the gap TEMPO-Diffusion is designed to fall through.

If the malicious distribution shift is associated only with a specific victim class and only during a scheduled training window, then sampling the model on a clean validation set produces clean outputs. The poisoned behavior is real and recoverable by an attacker who controls the relevant condition, but it does not appear in the slice of behavior that inference-time auditing inspects. A clean validation pass at inference is therefore consistent with a contaminated checkpoint.

The second-order consequence lands on teams whose safety posture is built around output audits. Content-policy checks scoped to banned or unsafe outputs will pass a model whose contamination is about provenance rather than content; the audit answers “does this model generate bad images?” when the relevant question has become “did this model see bad training?”

The claim generalizes past this one paper: inference-time validation cannot certify a model whose contamination is gated to training-time exposure. The audit has to move upstream of inference to find it, and most pipelines have nothing upstream to move to.

How does this become a synthetic-data supply-chain problem?

On CIFAR10, GTSRB, and a new balanced Canadian and U.S. traffic-sign dataset the authors call CALISA, TEMPO-Diffusion reliably poisoned class-specific synthetic-data generation and drove high attack success rates in downstream classifiers trained on that synthetic data. That makes it a direct synthetic-data supply-chain risk rather than a single-image generation defect.

The threat model that makes this bite is supply-chain injection: the attacker ships a poisoned checkpoint or a poisoned synthetic dataset into someone else’s training pipeline. Once the synthetic data is class-specific and the contamination is gated, a classifier trained on that synthetic data inherits the backdoor, and the arXiv paper reports high attack success rates in exactly that downstream setting. The same logic extends to distillation: a poisoned teacher transfers its gated behavior to the student, and the student’s training run looks clean because the contamination was injected before either model existed.

The choice of datasets sharpens the point. GTSRB is a traffic-sign recognition benchmark, and CALISA extends that domain to a balanced Canadian and U.S. traffic-sign set. A downstream classifier mislabeling a traffic sign is a domain where the consequence is concrete rather than cosmetic. The abstract reports the result as reliably poisoning class-specific generation across all three datasets without publishing a single headline attack-success-rate figure, so the honest characterization is “high and consistent,” not a specific number.

What do existing diffusion-backdoor defenses actually inspect?

Existing diffusion-backdoor defenses operate at inference time, and none inspect the training schedule or timestep exposure that TEMPO-Diffusion weaponizes. The TEMPO paper frames prior work as assuming attackers control the inference-time noise seed, the assumption its training-time exposure window discards.

The structural reason repeats across inference-time approaches. A defense that looks for the trigger in input noise assumes the trigger appears at inference; TEMPO exposes it during training. A defense that reverse-engineers the trigger assumes there is an inference-time trigger to invert; here the attacker supplies none. A defense that scans outputs for anomalous generations assumes the backdoor surfaces on the classes an auditor samples; victim-class gating keeps it off that slice. A defense that could catch TEMPO would need access to the training run itself, which is exactly what a downstream consumer of a public checkpoint does not have.

TNC-Defense, also published as Backdoor Sentinel, is the closest existing defense on TEMPO’s threat axis. It exploits what the authors call “temporal noise unconsistency,” the disrupted noise predictions between adjacent timesteps when a trigger fires, and reports an 11% improvement in average detection accuracy alongside invalidation of 98.5% of triggered samples. But TNC-Defense still operates at inference: the auditor queries noise predictions across timesteps at sampling time. It inspects the model’s behavior under trigger conditions, not the training schedule that produced the model. That is the exact distinction TEMPO-Diffusion relies on.

Diff-Cleanse takes a different route: a two-stage trigger-inversion plus structural-pruning pipeline that reports near-100% detection across hundreds of backdoored diffusion models. Its detection signal is the reconstructed attacker trigger, which the second stage then prunes away. The method assumes there is an inference-time trigger to invert; TEMPO-Diffusion supplies no such trigger, so the inversion has nothing to reconstruct.

What should fine-tuning and distillation teams add to checkpoint audits?

If a checkpoint’s poisoning is gated to training timesteps, an audit has to inspect the training schedule and per-timestep loss, not just sampled outputs, because an inference-time clean bill of health is no longer sufficient evidence that the model is clean.

For teams fine-tuning or distilling from a public image-model checkpoint, the addition is a per-timestep-loss audit. If the loss for a specific class diverges on a subset of timesteps that other classes do not share, that is the signature a TEMPO-style backdoor would leave behind. Sampling-based audits miss it because they never run the training loop; they only observe finished behavior.

The structural problem is that the audit most likely to catch this is the one a public-checkpoint consumer cannot run. A downloaded checkpoint rarely arrives with its training schedule, per-timestep logs, or data provenance attached. The per-timestep-loss audit is only available on runs you control. The implication is a sourcing rule: prefer checkpoints whose training is reproducible or logged, and treat unlogged public checkpoints as uncertifiable against this class of attack regardless of how clean their sampled outputs look.

A practical audit path, drawn from the gap the paper exposes:

  • Log and review per-timestep loss, broken out by class, for any fine-tune or distillation run you consume or produce.
  • Treat class-specific, timestep-localized loss anomalies as a backdoor signal worth investigating, not a numerical curiosity.
  • Sample the victim class under conditions that would expose the gated behavior, not only on a clean validation set.
  • Where the checkpoint is a black box with no training schedule attached, assume the inference-time audit cannot certify it. That is the honest state of the tooling today.

The broader read is that diffusion-backdoor defense organized itself around inference time because that is where the prior attacks lived. TEMPO-Diffusion is a reminder that a clean inference pass is only evidence about inference behavior. The poisoned model and the clean model look identical from the sampling side; the difference is in the schedule, which the audit has to be given access to in the first place.

Frequently Asked Questions

TEMPO-Diffusion and EMPDiffusion both involve timesteps. What is the actual difference?

EMPDiffusion already scales its trigger across timesteps, so scaling alone is not TEMPO’s contribution. TEMPO’s claimed novelty is the combination of in-distribution exposure, victim-class gating, and training-time scheduling, which together let non-victim classes behave cleanly. EMPDiffusion still relies on an attacker-controlled inference-time input that TEMPO discards.

Does TEMPO-Diffusion extend to text-to-image models like Stable Diffusion?

The published results cover class-conditional datasets only: CIFAR10, GTSRB, and the authors’ new CALISA traffic-sign set. The paper does not report results on text-to-image diffusion, so whether the training-time, victim-class-gated mechanism survives a text-conditional conditioning path is unverified. Practitioners should not assume Stable Diffusion-style checkpoints are exempt.

What is still unknown about TEMPO-Diffusion’s mechanism?

The public preprint does not pin down which timesteps carry the trigger, what fraction of training exposure is poisoned, or the exact scheduler. The HTML version cuts off before the full threat model, so verify any scheduling parameter against the PDF before quoting it. The abstract reports only high attack success rates with no single figure attached to TEMPO itself.

Which defense family is furthest from catching a TEMPO-style attack?

The input-noise family, including DisDet and the LITeN and SpecDet variants, looks for the trigger inside the noise the attacker feeds at sampling time. TEMPO supplies no inference-time noise trigger, so that family has nothing to detect. ELIJAH, which reverse-engineers an attacker trigger, fails for the same reason: there is no inference-time trigger to invert.

What schedule-side audit can a checkpoint consumer run without the original training logs?

A team that cannot inspect the original training schedule can still log per-timestep, per-class loss on a short fine-tune of the downloaded checkpoint. A TEMPO-style backdoor that re-activates on the victim class would surface as a class-specific loss divergence on a subset of timesteps during that probe run. The signal is indirect and assumes the gated behavior survives the fine-tune, but it is the only schedule-side evidence available without the original logs.

sources · 5 cited