The assumption that more safety training makes a model harder to jailbreak is wrong, at least under the evaluation framework presented in arXiv:2606.05614. Published June 4 by researchers at Singapore University of Technology and Design, the paper introduces the “Safety Paradox”: a single-query attack that converts a model’s own safety awareness into an exploit vector. Under the authors’ evaluation across 30 open-source models and several frontier systems, models with sharper refusal reasoning were consistently easier to break.
What Is the Safety Paradox
Most jailbreak research targets the model’s input processing: adversarial suffixes, prompt injections, encoding tricks. The Safety Paradox paper takes a different route. Its central observation is that safety alignment trains a model to recognize harmful content, and that recognition capability is itself exploitable.
The paper frames this in Bayesian terms. Standard jailbreaks attack the prior, P(Y=1|X=x), trying to craft an input x that the model doesn’t flag. The authors’ “Posterior Attack” instead targets P(Y=1|X=x,Z=1): given that the model has already identified content as unsafe (Z=1), what does that content look like? The attack asks the model to generate exactly that.
How Posterior Attack Works
The mechanism is straightforward. Rather than optimizing adversarial tokens or searching for jailbreak suffixes, the Posterior Attack constructs a single prompt that asks the model to describe or generate the content its internal safety classifier would flag. Because safety-aligned models have been trained to recognize harmful patterns, they possess a latent representation of what “harmful” looks like. The attack inverts that representation: instead of refusing, the model produces a detailed characterization of the thing it would refuse.
No gradient access is required. No search over token space. No multi-turn scaffolding. A single query suffices, which makes the attack both fast and difficult to detect at the request level. The attack code is publicly available.
The Numbers: Correlation Across 30+ Models
Under the authors’ evaluation, the correlation between a model’s safety-judgment accuracy and its susceptibility to Posterior Attack is strong: Pearson r = 0.80, Spearman ρ = 0.78, measured across 30 open-source LLMs up to 35B parameters and frontier models including GPT-5-Chat and Claude Sonnet 4.6.
The clustering tells the story. Older and smaller architectures, specifically the Llama 2 family, Llama 3.2 1B and 3B, and Qwen 3 0.6B, cluster below 65% safety-judgment accuracy and maintain near-zero attack success rates. These models don’t understand harmful content well enough to describe it on demand. Newer models tell the opposite story: Qwen 3.6 and Gemma 4 exceed 85% safety-judgment accuracy under the authors’ measurement but suffer attack success rates above 80%, approaching near-complete exploitability.
Why Newer, Safer Models Are Softer Targets
The paper traces the vulnerability to a functional side-effect of alignment. Safety training, whether through RLHF, constitutional AI methods, or similar approaches, cultivates what the authors describe as a latent capacity to recognize unsafe content. Prior mechanistic work has linked this capacity to linearly separable internal representations and localized “safety neurons” or “safety layers” within the model.
The Posterior Attack exploits the gap between suppression and understanding. A model that merely recites refusal templates without understanding why a request is dangerous is, paradoxically, more resistant. A model that genuinely understands the harmful category can be prompted to enumerate it.
The GRPO Causal Experiment
Correlation alone doesn’t establish causation. The authors address this with an intervention using GRPO (Group Relative Policy Optimization). By artificially degrading a model’s safety-judgment accuracy, they reduced its vulnerability to Posterior Attack. By enhancing judgment accuracy, they increased it.
This is the paper’s strongest claim to a causal mechanism rather than mere correlation. It also makes clear that the finding is not an artifact of model scale or architecture family. The direction of the intervention is what matters: worse safety judgment, better resistance; better safety judgment, worse resistance.
What This Means for LLM Deployers
The practical takeaway is direct: safety alignment is a behavior adjustment, not a security boundary. Treating a safety-RLHF pass as the primary defense against misuse assumes that alignment monotonically reduces risk. Under the authors’ evaluation conditions, that assumption does not hold.
For production deployments, three consequences follow:
Layer external guardrails. Input/output classifiers, content filters, and monitoring systems that operate independently of the model’s internal safety training provide defense in depth. Relying solely on the model’s refusal behavior is insufficient.
Evaluate your specific model, not the alignment tier. A “more aligned” checkpoint may carry higher posterior risk than a less-aligned predecessor. The paper’s correlation data suggests this is particularly true for recent-generation models with strong safety-judgment accuracy.
Monitor for posterior-style prompts. The attack does not require adversarial token optimization or unusual encoding. Queries that ask a model to describe or enumerate harmful content, even in abstract terms, should be treated as suspicious.
Open Questions and Limitations
Several boundaries on the paper’s conclusions deserve attention.
Attack success rates are measured by an automated classifier (HarmBench’s LLaMA-2-13B), not by human evaluators. Whether the “successful” attacks produce genuinely harmful content in practice, or content that a human would flag as dangerous, is a separate question. Automated evaluation may over- or under-count real harm.
The causal mechanism (GRPO interventions) is demonstrated on open-source models. The frontier-model correlations for GPT-5-Chat and Claude Sonnet 4.6 are observational. Vendors could reasonably argue that production deployments include additional runtime safeguards, input sanitization, or output filtering not captured in the paper’s evaluation. The paper does not test those layers.
The paper also does not propose a structural fix. The authors note that current alignment paradigms have structural flaws and that defense mechanisms require further refinement, but the path from observation to remedy is not mapped. Degrading safety judgment to resist attack is a proof of mechanism, not a deployment recommendation. A model that cannot recognize harmful content cannot refuse it reliably either.
The Safety Paradox is a finding about the architecture of alignment itself, not a specific exploit to patch. Until alignment methods can decouple the ability to recognize harmful content from the ability to generate it, the tension the paper identifies will persist.
Frequently Asked Questions
How does Posterior Attack differ from GCG or AutoDAN jailbreaks?
GCG and AutoDAN require gradient access to the target model and iterate over hundreds of token sequences to find an adversarial suffix. Posterior Attack uses neither gradients nor search: it is a single natural-language query. The tradeoff is that GCG-style methods can target models with weak alignment, while Posterior Attack specifically exploits strong safety training, making it ineffective against older models with poor harm-recognition capability.
Would the same mechanism apply to multimodal or agentic systems?
The evaluation covers text-only LLMs up to 35B parameters plus two frontier APIs. Multimodal models align image and audio understanding alongside text, and each aligned modality could carry its own posterior vulnerability if the model internalizes what harmful content looks like in that modality. Agentic systems that chain tool calls introduce additional surfaces: a posterior-style prompt could target the planning layer rather than the generation layer. Neither case has been tested.
Does using a safety-aligned model as the attack judge introduce circularity?
The HarmBench LLaMA-2-13B judge is a safety-aligned model, so it carries the same internal safety representations as the targets it evaluates. This creates a potential blind spot: attacks that produce content this specific classifier does not flag get counted as failures, even if a human would judge the output harmful. The 520-behavior AdvBench benchmark also predates contemporary harm categories such as AI-generated CSAM or election misinformation, so measured attack success rates may not generalize to the full threat landscape.
What would a structural fix that avoids this vulnerability look like?
Mechanistic interpretability work has shown that safety behavior correlates with linearly separable activations in specific layers. If alignment training could produce refusal behavior without forming these separable representations, the model would lack the internal map that Posterior Attack exploits. Potential approaches include adversarial perturbation of the representation space during training, or decoupling the recognition pathway from the generation pathway at the architecture level. The paper demonstrates the problem but does not evaluate any of these remedies.