groundy
security

Measuring LLM Safety by Refusal Alignment Instead of Attack Success Rate

A June 2026 preprint proposes RAS, a white-box metric that scores LLM safety by hidden-state refusal alignment rather than blocked output, challenging ASR-only leaderboards.

8 min · · · 6 sources ↓

On 24 Jun 2026, an arXiv preprint introduced RAS (Refusal Alignment Score), a metric that scores a model’s safety by how well its internal hidden states align with a learned refusal direction rather than by whether a jailbreak produces blocked output (arXiv:2606.25750). RAS reads intent from the model’s representations, not from the text it emits. Across Llama, Gemma, and Qwen it separates aligned checkpoints from uncensored and abliterated variants, and the authors report it tracks output-level attack success rate while running substantially faster than judge-based evaluation (arXiv:2606.25750).

What does RAS actually measure?

RAS is the output of SafeVec, a white-box procedure that reads a target model’s hidden states and compares them against a refusal direction lifted from a safety-aligned reference model (arXiv:2606.25750). The mechanics: SafeVec extracts layer-wise refusal directions from the reference model, selects the stable layer windows where safe and unsafe activations are separable, then scores the target by how far its hidden states align with those refusal directions under unsafe and jailbreak prompts.

The score is calibrated to a 0-100 range, which is what makes it legible as a leaderboard number rather than a raw cosine value (arXiv:2606.25750). The authors show it separates aligned checkpoints from uncensored and abliterated variants across Llama, Gemma, and Qwen, and that it correlates with output-level attack success rate on those families. They also report it is substantially faster than running a judge model over the completions, which is the economic argument for the entire approach (arXiv:2606.25750).

The stated motivation is a direct critique of the ASR-and-judge paradigm: output-level evaluation is expensive, sensitive to the choice of judge, and tied to fixed question banks (arXiv:2606.25750). Anyone who has watched a safety leaderboard swing because the judge model was swapped will recognize the complaint.

How can refusal alignment and attack success rate disagree?

Refusal alignment and attack success rate diverge when a model intends to refuse but emits unsafe text anyway, a split documented as the “refusal cliff” in reasoning models (arXiv:2510.06036). In that study, poorly-aligned checkpoints maintained strong refusal intention throughout their thinking trace, then showed a sharp drop in refusal score at the final tokens before producing output (arXiv:2510.06036). An output-level judge sees an unsafe completion; the model’s internal trajectory was steering toward refusal and lost it at the last moment.

The same study found that ablating roughly 3% of the identified attention heads drove attack success rate below 10% (arXiv:2510.06036). That is a small surgical edit producing a large behavioral shift, and it is exactly the regime where an internal-state metric and an output metric would report different things about the same model. RAS would register the degraded refusal alignment before the edit’s effect on text became systematic.

This is where the procurement implication sharpens. If the model that emits the unsafe text was, representationally, trying to refuse, then “did it refuse?” and “was it aligned to refuse?” are genuinely different questions. ASR answers the first. RAS attempts the second.

What does RAS mean for ASR-only safety leaderboards?

Most public safety benchmarks are output pipelines. RefusalBench (783 prompts across 7 domains in the original release), OR-Bench (80,000 prompts), and SORRY-Bench (450 prompts across 45 categories) all derive refusal rate, helpfulness, and over-refusal from generated completions (EmergentMind topic summary). A metric like RAS that reads internal state instead of outputs is not a drop-in replacement for any of them. It re-baselines what “safe” means, because it scores a different object: the model’s latent refusal direction rather than its emitted text.

The access constraint is structural. RAS needs hidden-state reads, which excludes the proprietary API models that dominate most leaderboards (arXiv:2606.25750). A closed model can publish an ASR figure; it cannot be RAS-scored by a third party without opening its weights. RAS does not so much threaten the existing API-leaderboard format as draw a line between open-weight models, which can be white-box scored, and closed ones, which cannot.

The paper’s own results bound the “ASR is wrong” critique. It reports that RAS tracks output-level attack success rate on the tested families (arXiv:2606.25750). On Llama, Gemma, and Qwen, the two metrics largely agree. The claim that ASR-only leaderboards are measuring the wrong thing is therefore an extrapolation from the refusal-cliff mechanism, not a measured delta on a specific production model where RAS and ASR diverge by a wide margin. The paper does not name one. Treat the leaderboard critique as plausible but unproven at production scale.

What adjacent evidence supports the intent-vs-output split?

Two independent findings reinforce that output-only scoring is incomplete: over-refusal on legitimate defensive work, and drift across languages.

The first is over-refusal. On 2,390 real defensive-cyber tasks from the NCCDC dataset, LLMs refused legitimate defensive requests that contained security-sensitive keywords at 2.72x the rate of semantically equivalent neutral requests (p<0.001) (arXiv:2603.01246). System-hardening requests were refused 43.8% of the time and malware-analysis requests 34.3% of the time (arXiv:2603.01246). Explicit authorization increased the refusal rate rather than lowering it (arXiv:2603.01246). An output-level metric registers all of these as refusals and scores the model as safe. A practitioner reads them as the model refusing the wrong things.

The second is multilingual drift. A separate study of refusal alignment across 12 European languages, accepted to Findings of ACL 2026, found that English-only alignment is insufficient for cross-lingual safety even within the same harm category (arXiv:2606.07535). That is a warning against treating any single RAS-style score as language-invariant: a model can be well-aligned on English prompts and misaligned on a cognate harm in another language, and an English-only score will not catch it.

Neither study validates RAS directly. Together they show that the refusal behavior ASR measures is itself unstable across context (defensive versus neutral framing) and across language, which is the deeper reason a representation-level signal is worth probing alongside the output one.

How should a team use RAS-style checks today?

Treat RAS today as an open-weight-only, reference-model-dependent probe to run alongside an existing ASR pipeline, not a replacement for it. Several hard limits frame any deployment.

First, RAS is white-box and reference-model-dependent. It needs hidden-state access, which rules out most proprietary APIs, and it inherits whatever “safe” means inside the chosen reference model (arXiv:2606.25750). A RAS score is only as neutral as the reference checkpoint used to extract the refusal direction; a team that picks a heavily-aligned reference and a lightly-aligned one will produce different rankings for the same target.

Second, the paper is a 24 Jun 2026 v1 preprint by Chang-Chieh Huang, Yan-Lun Chen, Chia-Mu Yu, and Wei-Bin Lee, not yet peer-reviewed, with the DataCite DOI still pending (DeepPaper summary). Headline numbers should be treated as preliminary until the work is independently reproduced.

Third, the cross-lingual finding means a single RAS score is not a universal safety axis (arXiv:2606.07535). A team serving multiple languages would need per-language reference directions, not one number.

The practitioner move, given those limits, is to report both metrics side by side. Run the ASR and judge pipeline already in place, and add a RAS-style white-box score for the open-weight models in the candidate set. Treat any model whose ASR and RAS disagree as a calibration problem to investigate, not a passing grade.

Frequently Asked Questions

What does it mean that RAS flags ‘abliterated’ variants, and why does it catch them?

Abliteration is a published technique that surgically suppresses the refusal direction in a model’s activations so it will not refuse anything, while leaving general capabilities intact. RAS catches these checkpoints precisely because it reads that direction: a model whose refusal direction has been edited away shows low alignment to the reference direction even before any jailbreak prompt is tried. An output-only ASR pipeline would need a prompt bank that actually elicits the unrefusal to score the same checkpoint as unsafe.

Where would RAS score a model as safe while an ASR check still catches a problem?

A safety mechanism that suppresses harmful text through a pathway the residual refusal direction does not capture, such as an output filter bolted onto the decoder or a reward model that reshapes the final logits, would leave RAS nearly unchanged while an ASR run still flags the raw completion. RAS reads alignment along one learned direction, so any guardrail that leaves no footprint in that direction is invisible to it. That is the mirror image of the refusal cliff, where ASR is blind to a model that intended to refuse.

What is the cost difference between running RAS and running a judge-based ASR pipeline?

A judge pipeline bills one judge-model forward pass per generated completion, multiplied per judge when teams ensemble, and the judge is usually a large model in its own right. RAS needs one forward pass over the target with hidden-state reads on the selected layers and no judge call, which is the structural reason the paper reports it as substantially faster. The saving is collectible only where the target already exposes hidden states, so the access constraint and the cost saving arrive together.

Could a closed-weight vendor publish a RAS score for its API model?

Only by running SafeVec on its own weights and attesting to the number, because a third party cannot read hidden states through a text-in-text-out API. That makes any RAS figure for a proprietary model self-reported, unlike an ASR figure a third party can reproduce from completions alone. A procurement team that wants a verifiable RAS score has to require open weights or a trusted-third-party audit with model access, which hands back the trust assumption RAS was meant to remove from judge-based scoring.

Does RAS replace any of the existing refusal benchmarks outright?

Every output-level refusal benchmark, including RefusalBench, OR-Bench, SORRY-Bench, and FalseReject, scores generated text, while RAS scores internal activations, so swapping one for the other changes what is measured. Those benchmarks also report over-refusal as a first-class number, which RAS does not produce from a single forward pass over unsafe prompts. A team that adopts RAS keeps its output benchmark for over-refusal and helpfulness and layers the white-box score alongside it rather than in place of it.

sources · 6 cited