Why Attack Success Rate Misleads LLM Jailbreak Benchmarks

Attack Success Rate is the headline metric on every major jailbreak benchmark, from JailbreakBench to StrongREJECT. A preprint from Chung-Ang University researchers (arXiv:2605.29629) now shows that ASR collapses mechanically distinct safety failures into a single number, and that two models with near-identical ASR scores can fail in completely different ways at the token level.

What ASR Actually Measures (And What It Discards)

ASR records a binary outcome: did the model produce a harmful response to a jailbreak prompt, yes or no? That pass/fail is assessed once, at the end of generation. Everything that happened during decoding is discarded.

According to Park, Ju and Lee, this means ASR cannot distinguish between three qualitatively different failure processes:

Suppressed refusal. The model began generating a refusal token, then switched to compliance partway through. The safety mechanism fired but was overridden.
Late refusal. The model produced compliant tokens for several steps before the refusal direction activated. By the time safety kicked in, the harmful content was already on the page.
No refusal triggered. The model never entered a refusal trajectory at all. The safety mechanism was absent from the start.

All three receive the same label in an ASR evaluation: “attack succeeded.” For a practitioner comparing two models with similar mid-range ASR scores, the number is silent on which failure mode dominates, whether a targeted intervention could address it, or whether the two models share any structural similarity in how they break.

How TLO Recovers the Hidden Trajectory

The paper introduces Temporal Logit Observability (TLO), a diagnostic that operates on the full logit vectors emitted at each decoding step. No hidden-state access, no gradients, no model modification required. The key instrument is the Logit-Margin Score (LMS): at each generated token, TLO measures the logit gap between the highest-probability compliance token and the highest-probability refusal token from a fixed lexicon of compliance and refusal keywords.

Projected across all decoding steps, LMS traces out a trajectory that reveals when (and whether) the model’s refusal direction activates during generation. TLO then maps each model, attack condition onto a calibrated 2D Relative Position (RP) plane, giving a geometric representation of the failure mode.

The logit trajectory separates successful from failed jailbreaks within the first few generated tokens, according to the authors. On 10 of 12 model, attack conditions tested, the logit-based LMS trajectory aligns with hidden-state refusal-direction probes, which do require internal model access. The two conditions that broke the pattern were both on Qwen2.5-7B, under MCM and DeepInception, where the fixed lexicon reversed the expected sign [Updated June 2026]. The implication: logit-level observability recovers much of the diagnostic signal that was previously available only through white-box methods, on most but not all model and attack combinations.

What the RP Plane Reveals Across Models and Attacks

The paper tests TLO across four aligned LLMs and three jailbreak paradigms. The core result: attacks with nearly identical ASR occupy clearly distinct locations on the RP plane. According to the authors, model pairs and attack families that look interchangeable on a leaderboard are, under TLO, geometrically distant in their failure mechanics.

This matters because RP-plane displacement compresses as (1 − ASR) increases. TLO carries the primary diagnostic signal in the mid-ASR range, where ASR alone says least about why attacks succeed. When failures are rare (low ASR), the binary metric remains informative enough on its own. In the mid-range, where most real-world model comparisons actually happen, TLO fills a gap that ASR cannot.

From Diagnosis to Defense: The Early-Stop Rule

The diagnostic insight has a direct defensive application. The paper derives an early-stop rule from a quantity called t_cross: the first decoding step at which the refusal direction dominates the compliance, refusal logit margin. The rule is blunt. If refusal has not asserted itself by step 5, the decode is treated as a likely jailbreak and the continuation is replaced with a standard refusal response.

According to the paper, this intervention drops aggregate Attack Success Rate from 39.6% to 13.1%, a 26.5-point reduction, with no false alarms on benign queries [Updated June 2026]. That is more than half the successful jailbreaks removed by a check on a single scalar over the first handful of tokens, with no model retraining and no gradient computation at inference. The absence of over-refusal is worth noting: many decoding-time guardrails trade safety for rejecting benign inputs, and the authors report no such tradeoff on the tested distribution. The caveat is that “benign” here means the paper’s format-free benign set, not the long tail of awkwardly phrased legitimate requests that production traffic actually carries, so the zero-false-alarm figure should be read as a clean-room result rather than a deployment guarantee.

This is not the only work exploiting logit-level signals for jailbreak defense. SelfGrader (arXiv:2604.01473v2) independently demonstrates that token-level logits are a viable safety signal, converting jailbreak detection into a numerical grading problem over numerical tokens (0, 9). The SelfGrader authors report up to 22.66% ASR reduction on LLaMA-3-8B with 173x lower memory and 26x lower latency than gradient-based guardrails. The convergence from two independent groups on logit-level observability as a safety signal suggests this is a genuine research direction, not a single-paper artifact.

ASR Was Already Known to Overstate Harm

TLO attacks ASR from the inside, at the token level. A separate line of work had already attacked it from the outside, at the level of what the metric counts as a success. The binary ASR question, “did the model comply,” flips to “attack succeeded” the moment the output stops being a refusal, regardless of whether the compliant text is actually usable, accurate, or harmful. A model that responds to a bomb-making prompt with confident nonsense scores identically to one that returns a working procedure.

The StrongREJECT benchmark measured exactly how large that gap is. Its authors built a continuous autograder, scored 0 for a useless response and 1 for a fully effective one, and found that many published jailbreaks reporting near-100% ASR scored below 0.2 when re-evaluated on GPT-4o, GPT-3.5 Turbo, and Llama-3.1-70B-Instruct, according to the Berkeley AI Research write-up. The attacks were lowering the model’s refusal rate without extracting much that an attacker could use. ASR registered a catastrophe; the graded harm was marginal. TLO and StrongREJECT are pointing at the same defect from opposite ends: ASR collapses a rich distribution into one bit, and that bit overcounts in two directions at once. It overcounts harm by crediting empty compliance, and it conflates failure modes by labeling every successful attack identically.

Both critiques also expose how much ASR leans on an unexamined judge. The “did it comply” verdict is itself usually produced by a classifier or an LLM grader, and the reliability of those jailbreak judges is far from settled: grader disagreement can move a reported ASR by tens of points on the same transcripts. A single headline number inherits all of that variance silently.

The alternatives now on the table are not subtle tweaks to the same metric. One direction reframes the measurement target entirely, scoring safety by how well a model’s refusal behavior aligns with a policy rather than by how often attacks land. A refusal-alignment score does not flip to failure on a single overridden token, and it distinguishes a model that refuses for the right reasons from one that happens to refuse. Another direction watches the model’s own internals during generation rather than grading the text afterward, for example tracking per-layer entropy to flag a jailbreak in progress. TLO sits between the two: it reads a signal richer than the final verdict but cheaper than full hidden-state access, which is the band where most independent auditing of open-weight models actually operates.

Why Benchmarks and Leaderboards Need to Change

ASR is the de facto headline metric across JailbreakBench (100 misuse behaviors, open-source leaderboard), GuidedBench, StrongREJECT, and JADES, according to EmergentMind’s 2026 survey of jailbreak robustness benchmarks. These benchmarks are beginning to supplement ASR with rubric-based, decompositional, and multi-dimensional scoring, but ASR remains the number that gets quoted.

The TLO paper makes that practice harder to defend. If, as the paper demonstrates, attacks with nearly identical ASR can land at clearly different points on a calibrated diagnostic plane, then a leaderboard that ranks them by ASR alone is presenting a false equivalence. The ranking implies comparability; the underlying mechanics say otherwise.

This is not an argument for discarding ASR entirely. The paper itself positions TLO as complementary: ASR remains most informative when failures are rare. The problem is that every leaderboard treats ASR as sufficient on its own, and the current ecosystem has no mechanism for reporting the process-level structure that ASR discards.

What Red Teams Must Now Do Differently

The practical consequence is straightforward. Red teams and safety evaluators who rely on ASR as their primary metric cannot tell whether two models fail for the same reasons. Token-level logit traces across the full generation, not just a pass/fail verdict on the final output, are now necessary to make that distinction.

This raises the bar for closed-model evaluation. If full-vocabulary logit access is required and most inference APIs return only top-k outputs, then the models most in need of independent safety auditing are the ones least amenable to TLO-style diagnostics. Open-weight models can be audited with the full protocol. Closed models require either API changes that expose logit distributions or acceptance that the audit is incomplete.

The second-order consequence for the benchmark ecosystem: any leaderboard that reports a single ASR number without process-level diagnostics is making a comparability claim that the evidence does not support. Fixing that means either expanding API surface area to expose logit signals, or acknowledging in every leaderboard entry that the headline metric is mechanically underspecified. Neither option is trivial, and neither is optional if the benchmarks want to remain defensible.

Frequently Asked Questions

Which jailbreak methods did the TLO evaluation cover?

The paper validated TLO on a 4x3 grid: Multi-Turn Context Manipulation (MCM), Greedy Coordinate Gradient (GCG), and DeepInception (DI) attacks against Llama-3.1-8B, Mistral-7B, Qwen2.5-7B, and Gemma-2-9B, with 60 harmful JailbreakBench prompts per condition under greedy decoding [Updated June 2026: an earlier version expanded MCM as “Multi-Chain Mutation” and DI as “Direct Instruction”; the correct names are Multi-Turn Context Manipulation and DeepInception]. These span multi-turn context pressure, gradient-based suffix optimization, and template-based semantic strategies. MCM itself is a multi-turn attack, submitting three context-normalizing turns before the harmful request. Multimodal attacks and encoded-prompt variants were not tested, and the grid stops at 7-to-9B open instruct models, so teams facing larger models or other attack classes should validate TLO independently before relying on its diagnostics.

Why did the fixed-lexicon approach struggle on Qwen specifically?

On Qwen2.5-7B the logit-margin sign reversed under MCM and DeepInception, leaving only Qwen plus GCG aligned with the cross-model pattern; the Fisher-z mean across Qwen conditions sat near zero [Updated June 2026: an earlier version attributed this limitation to Gemma]. The paper does not isolate a root cause, but the result implies that Qwen’s internal representation of compliance and refusal tokens differs enough from the other three models that a shared word list misses key signals. The authors note that model-specific or probe-derived lexicons recover more evidence in such cases. Teams deploying TLO across heterogeneous model fleets should budget for per-model lexicon calibration, which adds setup cost but likely improves the margin signal beyond what a universal list provides.

How does TLO’s detection logic differ from SelfGrader’s?

TLO computes a running margin between the top compliance-token logit and the top refusal-token logit at each decoding step, then maps the full trajectory onto a 2D plane for classification. SelfGrader takes a different path: it prompts the model to assign a harm score using digit tokens 0 through 9, then reads the logit distribution over those ten tokens as a self-assessment signal. Both exploit logit-level information that standard outputs discard, but TLO is a trajectory analysis across all tokens while SelfGrader is a single-step numerical judgment at a specific prompt point.

What would a TLO-aware jailbreak benchmark actually report?

Beyond a headline ASR percentage, a TLO-informed leaderboard entry would include RP-plane coordinates for each model and attack pair, the D_RP distance between models with similar ASR, and the dominant failure mode (suppressed refusal, late refusal, or no refusal triggered) for each condition. JailbreakBench’s current format records a single per-attack pass/fail per model; adding TLO dimensions would require a new results schema but would let practitioners distinguish whether two models ranked side by side share failure mechanics or merely share a count.

Does TLO require storing full logit vectors for every generated token?

Not necessarily. Targeted lexicon-token log-probabilities (just the compliance and refusal keyword scores) are sufficient in place of full-vocabulary vectors. A lexicon of a few hundred terms is orders of magnitude smaller than a full vocabulary distribution, which reduces per-query storage and memory. Teams can start with the reduced signal and upgrade to full vectors if diagnostic resolution proves insufficient for their model or attack landscape.