How Reliable Are the LLM Judges Scoring Jailbreak Attacks?

The attack-success rate on any jailbreak benchmark is not a property of the model being tested. It is partly a property of the judge scoring the outputs. A June 2026 preprint from Yang Gao at Veyon Solutions audited the two dominant judge families used across published jailbreak evaluations and found that both are unreliable in structurally different ways, and that either can be manipulated without removing the actual harmful content from the response.

Why does the ASR number look trustworthy but probably isn’t?

Published attack-success rates arrive in papers without error bars on the measurement instrument itself. The implicit assumption is that the judge classifying “harmful” versus “benign” is accurate enough that the ASR delta between model A and model B reflects a real capability difference. Gao’s paper tests that assumption against 596 human-labeled completions and finds it does not hold for either judge family in common use.

The situation is compounded by tooling. The AdversariaLLM paper found that implementation bugs alone in existing jailbreak frameworks produced up to 28% ASR improvement from correctness fixes, meaning a gap between two papers’ reported numbers can be explained entirely by framework differences rather than model capability. When the measurement instrument is noisy and the tooling is inconsistent, a number like “our model reduced ASR from 43% to 21%” carries less signal than it appears to.

What are the two judge families and how do they fail differently?

The two families fail in opposite directions: the dedicated classifier misses almost nothing but cries wolf constantly, while LLM-as-judges are conservative scorers whose recall collapses under distribution shift.

The HarmBench dedicated classifier, tested against 596 human-labeled completions, achieves precision 0.835 and recall 0.974. It almost never misses an actually harmful response. But with precision below 0.84, it over-flags benign content at a rate that inflates ASR before a single attack is run. A model that produces entirely safe outputs can still accrue apparent attack successes from the classifier’s false positives.

LLM-as-judges invert this. Gao’s evaluation of three different LLM-as-judges finds precision in the range of 0.81 to 0.94, which sounds acceptable, but recall ranging from 0.06 to 0.65. The same set of harmful model outputs will produce wildly different reported ASR depending solely on which of these three judges is used. Two papers evaluating the same model with different LLM judges could disagree by a factor of ten on recall alone, with neither being obviously wrong by standard reporting conventions.

How does benign wrapping fool LLM-judges without removing harmful content?

Content-preserving wrappers, which leave harmful text verbatim and add only benign framing around it, flip every LLM-judge tested between 57% and 100% of the time, according to Gao 2026. The benign framing includes a prepended refusal sentence, which alone accounts for 39% to 88% of those flips.

This matters because it decouples what the judge scores from what the model produced. The harmful synthesis instructions, the detailed method, the actionable content, remain unchanged. Only the packaging shifts. Yet the judge registers the output as safe. An attacker who understands this behavior has an easy post-processing step: if the raw completion scores as harmful, wrap it in benign framing and resubmit for scoring. The reported ASR drops without the model having done anything differently.

The dedicated HarmBench classifier is substantially more resistant here, with a flip rate of at most 6.7% across the same wrapper set per Gao’s results. That sounds like an argument for using it instead of LLM judges. The next section explains why that is not a clean conclusion.

Can the more robust classifier be broken cheaply?

Yes. A white-box GCG gradient attack against the HarmBench classifier’s open weights flips 70% of confident true positives (21 of 30, 95% CI 54 to 86%) even at a small optimization budget, per Gao 2026. A two-annotator audit confirmed that all 80 sampled verdict-flipped completions still contained the harmful content. The judge was fooled; the harm was not removed.

The cost here is essentially nothing. The paper characterizes the attack as requiring only a small optimization budget. This is not a sophisticated research-grade capability; it is something a red-teamer or an adversary trying to game a safety evaluation could run as a preprocessing step.

The asymmetry is uncomfortable. LLM judges are brittle to surface-level wrapping attacks that require no gradient access. The dedicated classifier is sturdier against wrapping but breaks under a cheap gradient attack because its weights are public. Neither family is robust across both threat models simultaneously.

Does distribution shift make judges near-random under real red-team conditions?

Under real red-teaming conditions, performance can degrade to near random. A prior ICML paper (“A Coin Flip for Safety”), using 6,642 human-verified labels, showed that three distribution shifts inherent to red-teaming cause LLM judge performance to collapse: attack shift (using a different attack family than the judge was implicitly calibrated against), model shift (evaluating a different target model), and data shift (evaluating a different domain of harm). Each independently degrades judge reliability; their combination pushes performance toward chance.

The same paper found that many attacks inflate their apparent success rates by exploiting judge insufficiencies rather than by actually eliciting harmful content from the model. The attack learns the judge’s blind spots and routes around them. This creates a closed loop: if a benchmark’s attack and its judge share enough distributional structure, the leaderboard is measuring the interaction between attack and judge, not model robustness.

The implication is that most published evaluations report ASR under distributional conditions favorable to their judge without disclosing this. A model that scores well on HarmBench may simply match the distributional assumptions of the HarmBench classifier rather than represent a genuinely more robust model.

What would judge error bars and corrected ASR actually look like in practice?

Three minimum standards address this directly: report judge precision and recall on a human-labeled slice, report ASR corrected for judge precision, and include an adversarial check of the judge itself, per Gao 2026. Code for all three checks is released with the paper.

Concretely: if a judge has precision 0.835 and reports, say, 40% ASR, the corrected figure accounting for false positives is lower. If recall is 0.974, the upper-bound correction for false negatives raises it slightly. The math is not exotic; it is the calibration anyone running a noisy binary classifier would apply automatically. Jailbreak research has not adopted it as standard practice.

A properly reported evaluation would state something like, for example: ASR 38%, corrected for judge precision 0.835; judge recall 0.974; judge adversarially probed with wrapper attacks, flip rate at most 6.7%. That is one additional paragraph. The infrastructure is not the obstacle; the norm is.

What does this mean for safety vendors, leaderboards, and policy?

Any vendor citing “eliminated verified jailbreaks” or “near-zero ASR” in marketing materials is implicitly claiming their judge is trustworthy. These findings suggest that claim requires its own verification.

The practical attack surface for benchmark gaming is now documented. An organization optimizing for leaderboard position rather than genuine robustness has two cheap options: use a high-precision LLM judge that underreports harm (reducing visible ASR without improving the model) or optimize against the wrapper-flip behavior of LLM judges in post-processing (harmful content unchanged, judge score improved). Neither requires modifying the model’s actual behavior under adversarial prompts.

For red-team leaderboards, the comparability problem is acute. Two teams evaluating the same model checkpoint with different judges could report ASRs that differ by a substantial factor, both within the plausible range of their respective instrument’s characteristics, with no indication to a reader which number more accurately reflects model behavior. Published jailbreak research should be read as a measurement of model plus attack plus judge jointly, not model robustness in isolation.

The regulatory angle is not hypothetical. If a safety assessment submitted under an EU AI Act conformity framework uses an unaudited judge, the assessment inherits the judge’s false-positive and false-negative rates as invisible systematic biases. A regulator reading an ASR of, say, 2% as evidence of robust refusal behavior has no way to know whether that number reflects genuine safety or a judge that over-flags benign content and can be wrapped into submission. The Gao paper’s three-standard proposal is not a research suggestion; it is the minimum bar for any ASR figure used to make a safety claim that matters outside a research context.

The field has been measuring compliance with a ruler that bends under load and has not been telling anyone.

Frequently Asked Questions

Which specific judges did the Gao audit actually evaluate?

The audit named its three LLM judges as Qwen2.5-7B-Instruct, Phi-3.5-mini-instruct, and Qwen2.5-3B-Instruct, and the dedicated classifier as cais/HarmBench-Llama-2-13b-cls. All four are open weights, which is also the precondition that made the white-box GCG attack feasible in the first place.

Does this verdict extend to closed judges like GPT-4o that vendors actually use?

No. The paper restricted itself to open, ungated models and explicitly did not test closed frontier judges such as GPT-4o. Whether proprietary judges resist wrapper framing or gradient suffixes is an unresolved gap, and most commercial safety reports sit on exactly those uncharacterized judges.

What does it actually cost in compute to flip the HarmBench classifier?

The GCG suffix was 20 tokens, optimized over 50 steps with 32 candidate suffixes per step, on a single NVIDIA T4. That is well under an hour of consumer GPU time to flip 21 of 30 confident true positives, which puts the attack within reach of any red team with a rented T4.

How does Best-of-N specifically inflate its own reported success rate?

Best-of-N samples many completions and keeps any that the judge flags as harmful, so the attack harvests the judge’s false positives alongside genuine hits. The Coin Flip paper used Best-of-N as the canonical case of an attack whose headline gains come from judge insufficiency rather than more effective elicitation.

If a team has to pick one judge today, which failure mode is cheaper to live with?

The HarmBench classifier’s over-flagging is the cheaper defect, because you can compensate with a precision-corrected ASR using its measured 0.835 precision. An LLM judge’s collapsing recall is harder to fix, since recall swings from 0.06 to 0.65 per model and cannot be corrected without a human-labeled slice that most papers never build.