Why AI Red-Teaming Rediscovers the Same Jailbreaks and Misses the Rest

When an AI vendor publishes a safety report claiming their model was “red-teamed,” the implicit promise is that someone checked the attack surface thoroughly. Stable-GFlowNet (arXiv:2605.00553), an ICML 2026 Spotlight paper by Minchan Kwon, exposes a structural flaw in that promise: the automated attackers most teams use tend to find the same narrow band of jailbreaks over and over, leaving wide regions of vulnerability invisible to the audit.

What “We Red-Teamed It” Actually Means

Red-teaming for LLMs is the practice of proactively probing a model for harmful outputs before deployment. The idea borrows from cybersecurity, where a red team simulates adversarial behavior to find weaknesses. In the LLM context, the goal is to discover prompt-based attacks that bypass safety training and elicit prohibited content.

The problem is measurement. A red-team exercise produces two outputs: a list of discovered attacks and a count of attacks that succeeded. Vendors report the latter, often framed as an attack success rate (ASR) that declined after a safety training round. What goes unreported is how much of the actual attack space the red-team explored. A low ASR from a narrow search tells you less than a higher ASR from a broad one.

The Mode-Collapse Problem

Gradient-based and reinforcement-learning-based attackers optimize for attack success. That is their job. But optimization for success rate alone produces a well-known failure mode: the attacker converges on a small number of high-probability jailbreaks and stops exploring. The paper identifies this as mode collapse, where the automated red-teamer returns variants of the same attack rather than sampling the full distribution of possible exploits.

This is not a new observation in machine learning. Generative models of all kinds collapse toward high-density modes. In red-teaming, the consequence is concrete: the safety report lists a handful of attack categories, the vendor patches those specific categories, and the report reads as clean. The attacks the red-teamer never found remain unpatched.

Generative Flow Networks (GFNs) offer a theoretical fix. GFNs sample from distributions proportional to a reward function, which means they can, in principle, generate diverse attacks weighted by success. But GFNs have their own well-documented instability problems. The reward signal in red-teaming is noisy, a prompt that jailbreaks the model on one run may fail on the next, and that variance destabilizes GFN training, accelerating the same mode collapse the method was supposed to prevent.

How Stable-GFN Works

Stable-GFN (S-GFN) makes two structural changes to the standard GFN framework.

First, it removes the partition function Z from training entirely. Instead of estimating Z directly, S-GFN uses pairwise comparisons between trajectories. This trades an unstable global estimate for a series of local comparisons, each of which is more tolerant of reward noise. The paper shows this preserves the optimal GFN policy while reducing the training variance that drives collapse.

Second, S-GFN introduces a robust masking methodology that filters noisy or unreliable reward signals during training, and a fluency stabilizer that prevents the model from drifting into local optima that produce incoherent outputs. The latter matters because a jailbreak that only works on gibberish prompts is not operationally useful, and a red-teamer that optimizes for such prompts inflates its own success rate without finding real vulnerabilities.

The pairwise-comparison approach has precedent in preference-based RL and reward modeling, but applying it to the partition function estimation in GFNs appears to be new. The paper’s contribution is less a novel architecture and more a targeted engineering fix for a known instability bottleneck.

What the Results Show

The paper reports “overwhelming attack performance and diversity” for S-GFN across multiple evaluation settings compared to prior automated red-teaming methods.

Two caveats are worth stating directly. First, the paper addresses automated red-teaming, specifically GFN-based approaches. It does not evaluate whether its findings generalize to human red-teams, which already use manual diversification strategies, or to other automated methods like prompt mutation or genetic search. Second, the paper measures attack diversity in the space of generated prompts, not in the space of real-world misuse patterns. A diverse set of synthetic jailbreakes may or may not correspond to the attacks actual users would attempt.

The Certification Gap

Current safety evaluations report attack success rate. Some report the number of distinct attack categories discovered. As of mid-2026, almost none report a formal measure of attack diversity across the feasible attack space. Without a diversity metric, a clean safety report is ambiguous: it could mean the model is robust, or it could mean the red-teamer’s search was narrow.

This matters for policy. The EU AI Act and comparable regulatory frameworks require risk assessment for high-risk AI systems, but they do not specify how thorough the adversarial testing must be. A vendor can satisfy the letter of the requirement with a narrow automated sweep. The paper suggests that such sweeps may systematically undercount the attack surface, not by a small margin but by leaving entire categories of jailbreaks undiscovered.

The S-GFN paper is a methods contribution that points to a governance problem. Automated red-teaming tools will keep improving, and attacks that one generation misses will be found by the next. But the structural gap, the absence of a diversity requirement in safety certification, will persist until either regulators mandate it or vendors adopt it voluntarily. The paper does not claim to solve that governance problem. What it does is make the gap harder to ignore. If a Spotlight paper at one of the field’s top conferences can show that standard attackers systematically miss large regions of the attack space, then every safety report that relies on those attackers inherits that blind spot.

The fix is not a better algorithm alone. It is a better standard for what “we red-teamed it” requires.

Frequently Asked Questions

Does S-GFN improve human red-team exercises too?

No. The paper evaluates only automated GFN-based attackers. Human red-teams already use manual diversification strategies like persona-based prompt variation and multilingual probing. The mode-collapse finding applies specifically to gradient and RL optimization loops that converge on high-reward modes. Human teams have their own coverage gaps, but they stem from cognitive bias and time constraints rather than mathematical convergence.

How do S-GFN’s pairwise comparisons differ from reward modeling in RLHF?

Both replace a scalar reward with relative judgments, but the targets differ. RLHF reward models compare two outputs of the target model to train a preference predictor. S-GFN compares two trajectories through the attack-generation process to estimate which path yields higher-reward jailbreaks without computing a global partition function. The structural similarity is deliberate: both trade an unstable global estimate for noise-tolerant local comparisons.

What types of jailbreaks do GFN-based methods still fail to find?

The paper measures diversity in the space of generated single-turn text prompts, not in the space of real-world misuse. A jailbreak that works via multi-turn social engineering may never appear in a GFN sample because the method generates single prompts in isolation. Jailbreaks that exploit multimodal inputs (images, audio) or tool-use interfaces fall outside the text-only threat model the paper evaluates.

What concrete metric should a safety report include alongside ASR?

A coverage estimate: the ratio of distinct attack categories discovered to an upper bound on feasible categories, similar to how fuzzing reports include edge coverage as a percentage of total branch count. Without it, a 2% ASR from 5 attack categories and a 2% ASR from 50 categories produce identical safety reports. Current LLM benchmarks measure reasoning and alignment, but their coverage depends entirely on the diversity of adversarial inputs they were tested against.

Could regulators mandate attack diversity without stalling model releases?

The EU AI Act requires risk assessment for high-risk systems but specifies no minimum adversarial testing coverage, so a single narrow automated sweep satisfies the letter of the law. Mandating a diversity floor would require agreeing on a taxonomy of attack categories and a threshold for adequate coverage, neither of which exists as of mid-2026. A premature standard risks locking in a narrow taxonomy that misses novel attack vectors, making the certification counterproductive.