groundy
security

Why Attack Success Rate Misleads LLM Jailbreak Benchmarks

The ASR metric behind every jailbreak leaderboard collapses distinct safety failures into one number, so models with the same score can fail in completely different ways.

7 min · · · 4 sources ↓

Attack Success Rate is the headline metric on every major jailbreak benchmark, from JailbreakBench to StrongREJECT. A preprint from Chung-Ang University researchers (arXiv:2605.29629) now shows that ASR collapses mechanically distinct safety failures into a single number, and that two models with near-identical ASR scores can fail in completely different ways at the token level.

What ASR Actually Measures (And What It Discards)

ASR records a binary outcome: did the model produce a harmful response to a jailbreak prompt, yes or no? That pass/fail is assessed once, at the end of generation. Everything that happened during decoding is discarded.

According to Park, Ju and Lee, this means ASR cannot distinguish between three qualitatively different failure processes:

  1. Suppressed refusal. The model began generating a refusal token, then switched to compliance partway through. The safety mechanism fired but was overridden.
  2. Late refusal. The model produced compliant tokens for several steps before the refusal direction activated. By the time safety kicked in, the harmful content was already on the page.
  3. No refusal triggered. The model never entered a refusal trajectory at all. The safety mechanism was absent from the start.

All three receive the same label in an ASR evaluation: “attack succeeded.” For a practitioner comparing two models with similar mid-range ASR scores, the number is silent on which failure mode dominates, whether a targeted intervention could address it, or whether the two models share any structural similarity in how they break.

How TLO Recovers the Hidden Trajectory

The paper introduces Temporal Logit Observability (TLO), a diagnostic that operates on the full logit vectors emitted at each decoding step. No hidden-state access, no gradients, no model modification required. The key instrument is the Logit-Margin Score (LMS): at each generated token, TLO measures the logit gap between the highest-probability compliance token and the highest-probability refusal token from a fixed lexicon of compliance and refusal keywords.

Projected across all decoding steps, LMS traces out a trajectory that reveals when (and whether) the model’s refusal direction activates during generation. TLO then maps each model, attack condition onto a calibrated 2D Relative Position (RP) plane, giving a geometric representation of the failure mode.

The logit trajectory separates successful from failed jailbreaks within the first few generated tokens, according to the authors. On 10 of 12 model, attack conditions tested, the logit-based LMS trajectory aligns with hidden-state refusal-direction probes, which do require internal model access. The implication: logit-level observability recovers much of the diagnostic signal that was previously available only through white-box methods.

What the RP Plane Reveals Across Models and Attacks

The paper tests TLO across four aligned LLMs and three jailbreak paradigms. The core result: attacks with nearly identical ASR occupy clearly distinct locations on the RP plane. According to the authors, model pairs and attack families that look interchangeable on a leaderboard are, under TLO, geometrically distant in their failure mechanics.

This matters because RP-plane displacement compresses as (1 − ASR) increases. TLO carries the primary diagnostic signal in the mid-ASR range, where ASR alone says least about why attacks succeed. When failures are rare (low ASR), the binary metric remains informative enough on its own. In the mid-range, where most real-world model comparisons actually happen, TLO fills a gap that ASR cannot.

From Diagnosis to Defense: The Early-Stop Rule

The diagnostic insight has a direct defensive application. The paper derives an early-stop rule from a quantity called t_cross: the decoding step at which the compliance, refusal logit margin crosses a calibrated threshold. When t_cross fires early and in the compliance direction, the generation is halted before harmful content appears.

According to the paper, this intervention cuts successful jailbreaks by more than half across the tested model and attack combinations, with no false alarms on benign queries. That absence of over-refusal is worth noting: many decoding-time guardrails trade safety for rejecting benign inputs, and the authors report no such tradeoff on the tested distribution.

This is not the only work exploiting logit-level signals for jailbreak defense. SelfGrader (arXiv:2604.01473v2) independently demonstrates that token-level logits are a viable safety signal, converting jailbreak detection into a numerical grading problem over numerical tokens (0, 9). The SelfGrader authors report up to 22.66% ASR reduction on LLaMA-3-8B with 173x lower memory and 26x lower latency than gradient-based guardrails. The convergence from two independent groups on logit-level observability as a safety signal suggests this is a genuine research direction, not a single-paper artifact.

Why Benchmarks and Leaderboards Need to Change

ASR is the de facto headline metric across JailbreakBench (100 misuse behaviors, open-source leaderboard), GuidedBench, StrongREJECT, and JADES, according to EmergentMind’s 2026 survey of jailbreak robustness benchmarks. These benchmarks are beginning to supplement ASR with rubric-based, decompositional, and multi-dimensional scoring, but ASR remains the number that gets quoted.

The TLO paper makes that practice harder to defend. If, as the paper demonstrates, attacks with nearly identical ASR can land at clearly different points on a calibrated diagnostic plane, then a leaderboard that ranks them by ASR alone is presenting a false equivalence. The ranking implies comparability; the underlying mechanics say otherwise.

This is not an argument for discarding ASR entirely. The paper itself positions TLO as complementary: ASR remains most informative when failures are rare. The problem is that every leaderboard treats ASR as sufficient on its own, and the current ecosystem has no mechanism for reporting the process-level structure that ASR discards.

What Red Teams Must Now Do Differently

The practical consequence is straightforward. Red teams and safety evaluators who rely on ASR as their primary metric cannot tell whether two models fail for the same reasons. Token-level logit traces across the full generation, not just a pass/fail verdict on the final output, are now necessary to make that distinction.

This raises the bar for closed-model evaluation. If full-vocabulary logit access is required and most inference APIs return only top-k outputs, then the models most in need of independent safety auditing are the ones least amenable to TLO-style diagnostics. Open-weight models can be audited with the full protocol. Closed models require either API changes that expose logit distributions or acceptance that the audit is incomplete.

The second-order consequence for the benchmark ecosystem: any leaderboard that reports a single ASR number without process-level diagnostics is making a comparability claim that the evidence does not support. Fixing that means either expanding API surface area to expose logit signals, or acknowledging in every leaderboard entry that the headline metric is mechanically underspecified. Neither option is trivial, and neither is optional if the benchmarks want to remain defensible.

Frequently Asked Questions

Which jailbreak methods did the TLO evaluation cover?

The paper validated TLO on Multi-Chain Mutation (MCM), Greedy Coordinate Gradient (GCG), and Direct Instruction (DI) attacks across Llama, Mistral, Qwen, and Gemma. These span optimization-based, prompt-rewriting, and direct-request strategies, but multi-turn jailbreaks, multimodal attacks, and encoded-prompt variants were not tested. Teams facing those attack classes should validate TLO independently before relying on its diagnostics.

Why did the fixed-lexicon approach struggle on Gemma specifically?

The paper does not isolate a root cause for Gemma’s lexicon mismatch, but the result implies that Gemma’s internal representation of compliance and refusal tokens differs enough from the other three models that a shared word list misses key signals. Teams deploying TLO across heterogeneous model fleets should budget for per-model lexicon calibration, which adds setup cost but likely improves the margin signal beyond what a universal list provides.

How does TLO’s detection logic differ from SelfGrader’s?

TLO computes a running margin between the top compliance-token logit and the top refusal-token logit at each decoding step, then maps the full trajectory onto a 2D plane for classification. SelfGrader takes a different path: it prompts the model to assign a harm score using digit tokens 0 through 9, then reads the logit distribution over those ten tokens as a self-assessment signal. Both exploit logit-level information that standard outputs discard, but TLO is a trajectory analysis across all tokens while SelfGrader is a single-step numerical judgment at a specific prompt point.

What would a TLO-aware jailbreak benchmark actually report?

Beyond a headline ASR percentage, a TLO-informed leaderboard entry would include RP-plane coordinates for each model and attack pair, the D_RP distance between models with similar ASR, and the dominant failure mode (suppressed refusal, late refusal, or no refusal triggered) for each condition. JailbreakBench’s current format records a single per-attack pass/fail per model; adding TLO dimensions would require a new results schema but would let practitioners distinguish whether two models ranked side by side share failure mechanics or merely share a count.

Does TLO require storing full logit vectors for every generated token?

Not necessarily. Targeted lexicon-token log-probabilities (just the compliance and refusal keyword scores) are sufficient in place of full-vocabulary vectors. A lexicon of a few hundred terms is orders of magnitude smaller than a full vocabulary distribution, which reduces per-query storage and memory. Teams can start with the reduced signal and upgrade to full vectors if diagnostic resolution proves insufficient for their model or attack landscape.

sources · 4 cited

  1. JailbreakBench primary accessed 2026-06-01
  2. Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures primary accessed 2026-06-01
  3. SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits primary accessed 2026-06-01
  4. Jailbreak Robustness Benchmarks analysis accessed 2026-06-01