groundy
security

Catching LLM Jailbreaks by Watching Per-Layer Entropy, Not Outputs

A June 2026 paper reports jailbreaks perturb per-layer entropy of frozen LLMs before any harmful token emits, but adaptive attackers will likely follow one layer deeper.

9 min · · · 6 sources ↓

A June 2026 preprint reports that jailbreak attempts leave a measurable trace in the entropy dynamics of an LLM’s intermediate layers, visible before any harmful token is generated. If the claim holds under pressure, it relocates jailbreak defense one layer deeper than output-side classifiers and refusal tuning, and reframes the open question from whether the signal can be detected to whether it can be detected against an attacker who already knows it is being watched.

Do jailbreaks leave a detectable signature before any harmful token emits?

The paper’s central claim is that they do, and that the signature lives in how predictive entropy evolves across token positions rather than in the entropy values themselves. arXiv:2606.25182, “What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics,” submitted 23 June 2026 and accepted at ECML PKDD 2026 with a short version at EIML@ICML 2026, applies the logit lens to a frozen LLM and tracks token-level predictive entropy trajectories across layers. The authors report that jailbreak-relevant structure shows up as structured intermediate uncertainty dynamics.

That caveat matters because the headline, “jailbreaks detectable from internal entropy,” is exactly the kind of claim that gets paraphrased into a number in the trade press. The honest summary is that the paper establishes a separation and a mechanism, not a deployable detection rate.

Why do intermediate layers see what the output layer misses?

The jailbreak-discriminating signal concentrates in mid-network representations and degrades by the final layer, according to the paper’s own measurements. The output head, where the model commits to surface tokens, is the least informative place to look.

This fits a pattern already established in mechanistic interpretability. Arditi et al. (arXiv:2406.11717) found that refusal is mediated by a single one-dimensional subspace across 13 open-source chat models up to 72B parameters, and demonstrated a white-box jailbreak that surgically disables refusal by editing that direction. That result is double-edged: internal geometry is both an attack surface and a detection surface. The entropy-dynamics paper leans on the defensive edge, arguing the trajectory is a passive fingerprint readable off a frozen model without any weight edits.

The natural reading of why mid-layers carry the signal and the output head does not is that refusal training pushes the output distribution toward safe completions, so a jailbreak that survives to emission has already reconciled itself with the refusal machinery at the top. The mid-layers, where the model is still representing intent rather than committing to tokens, are where the perturbation has not yet been smoothed away. That is an interpretation consistent with the paper’s finding, not a claim the abstract makes explicitly.

Why don’t average entropy statistics work?

Aggregate prompt-level entropy statistics, mean and variance, carry little discriminative signal, according to the paper; the informative features are rank-based trend scores that capture how entropy evolves across token positions.

This is the methodologically interesting move. A jailbreak prompt and a benign one can share nearly identical average uncertainty. What separates them is the shape of the trajectory: whether entropy climbs monotonically as the model leans into a harmful continuation, or stays flat. Monotonic rank-based trend scores capture that shape in a way moments of the distribution cannot. The choice of a rank-based rather than raw-valued feature is also a robustness choice: rank statistics depend on the ordering of uncertainty across positions, not on the absolute calibration of a layer’s logits, which varies across model families.

A related paper takes a different structural route to the same problem. Kadali & Papalexakis (arXiv:2510.06594) analyze the internal representations of GPT-J and Mamba2, presenting preliminary findings on distinct layer-wise behaviors when hidden layers respond to jailbreak versus benign prompts. The authors frame it as an early exploration rather than a conclusive solution. The two papers agree on the headline, that internals carry the signal, and differ on the feature: structural analysis of hidden-layer representations versus entropy trajectories read through the logit lens.

Does the signal hold across different model families?

The paper reports the separation is consistent across Llama, Qwen, and Gemma, across multiple adversarial benchmarks, and without additional training. The training-free, architecture-consistent framing is the strongest part of the claim. A detector that needs no fine-tuning and transfers across three model families is a different proposition from a per-model probe that has to be refit for every checkpoint.

What “consistent” means in practice is the kind of detail the abstract does not settle. It could mean a single trend-score threshold separates jailbreak from benign traffic across all three families, or it could mean a family-specific threshold separates cleanly within each. The first is a deployable result; the second is still useful but imposes a per-family calibration step. The full tables are where that distinction lives.

What happened to the last wave of internals-based defenses?

Every prior internals-based defense has degraded under attackers who optimize against the probe, and there is no published reason yet to assume entropy dynamics are exempt.

The clearest recent case is activation steering. A Tencent Cloud write-up summarizing arXiv:2605.24535 (“Steering Beyond the Support”) reports that supervised activation-steering trained on GCG collapses on attacks it was not trained on: PAIR at roughly 38.47% success rate and AutoDAN at roughly 35.78% on GCG-only training. The same write-up reports that an unsupervised, adversarially-trained steering field, when attacked by a steering-aware adaptive GCG that minimizes the defense’s own gradient norm, still leaves attack success at 15.54% on Mistral-v2-7B, against an undefended baseline of 63.34%. (Those figures come from the third-party summary of 2605.24535, not from the entropy-dynamics authors.)

The defense still helps: 15.54% is roughly a quarter of the undefended 63.34% baseline on Mistral-v2-7B. The contest does not end under adaptive pressure; it shifts to a new internal feature.

The related internals-based defenses are not interchangeable, and the differences matter for which caveat applies.

ApproachWhat it probesSetupReported result
Entropy dynamicsPer-layer predictive entropy trajectories via logit lensTraining-freeMid-layer separation across Llama/Qwen/Gemma; qualitative, no public detection-rate numbers
Refusal directionSingle 1D refusal subspaceWhite-box editDisables refusal across 13 models up to 72B
Hidden-layer analysisInternal representations of hidden layersPreliminary studyDistinct layer-wise behaviors on GPT-J and Mamba2; no published detection metrics
TSSFHarmfulness feature, layer-wise logit fusionAttention realignment + layer-wise fusionRestores the harmfulness feature that discriminates unsafe from safe inputs
AlphaSteer / ULDDActivation steering fieldSupervised, then adversarialCollapses on unseen attacks; adaptive GCG still leaves ASR at 15.54% vs 63.34% baseline on Mistral-v2-7B

Does this replace refusal training, or sit alongside it?

It sits alongside it, and downstream of it in the defense stack.

Refusal training through RLHF and DPO, plus output-side guardrail classifiers, is the current default. An entropy-trajectory monitor is a runtime check on the model’s own activations, closer to an intrusion-detection signal than a firewall rule. The ECNU TSSF framework (arXiv:2511.14423) makes a related argument. TSSF identifies a “harmfulness feature” that discriminates between unsafe and safe inputs, restores it via safety-aware attention realignment, and aggregates safety cues across layers via layer-wise logit fusion. The claim that the harmfulness feature is the more stable detection signal, because refusal can be trained around while the underlying harmfulness representation persists, is this article’s interpretation; TSSF’s abstract names the harmfulness feature but does not draw the refusal contrast explicitly. The entropy-dynamics paper and TSSF share the thesis that the output head is the wrong place to look, and differ on which internal feature to read.

The practical implication is that internals-based detection is additive. It catches the case where a jailbreak has been tuned to produce a benign-looking surface form that nonetheless required a perturbed internal trajectory to generate. It does not replace the cheaper, broader defenses; it targets the attacks that specifically defeat them.

What should practitioners do with this today?

Instrument it, do not ship it as a shield.

The training-free claim makes the instrumentation cost low. Log per-layer entropy trajectories on a sample of production traffic, fold in known-jailbreak prompts from public adversarial corpora, and check whether rank-based trend scores separate them from benign traffic on your specific model and deployment. The hard part is not collecting the signal; it is establishing your own false-positive baseline, which the public abstract does not provide and which will be specific to your traffic mix.

Until the adaptive-attacker question has an answer, the honest posture is to treat entropy-trajectory monitoring as a signal worth collecting and a research bet worth tracking, not a control you can gate production on. The most likely outcome, based on the track record of every internals-based defense before it, is that detection relocates the arms race one layer deeper rather than ending it. That is still a useful outcome. It is not the outcome the headline implies.

Frequently Asked Questions

Can entropy-dynamics monitoring protect closed-weight API models like GPT-4 or Claude?

No. The method applies the logit lens to a frozen model’s per-layer logits, which requires white-box access to intermediate activations. OpenAI and Anthropic expose only output tokens through their APIs, so the detector is restricted to open-weight families like Llama, Qwen, and Gemma where the full forward pass is observable.

How does the entropy-trajectory method differ from the earlier tensor-decomposition jailbreak work?

Kadali and Papalexakis (arXiv:2510.06594) apply CP tensor decomposition to GPT-J and Mamba2 hidden layers and report 5-fold cross-validated F1 scores, finding that Multi-Head Attention outputs separate jailbreak from benign traffic better than aggregated layer outputs. The entropy paper instead reads rank-based trend scores off logit-lens trajectories, a lighter-weight feature that ports across Llama, Qwen, and Gemma without a per-model tensor fit.

How advanced are current attacks at exploiting internal model signals?

Far enough to worry any internals-based defense. arXiv:2502.01633 (‘Adversarial Reasoning at Jailbreaking Time’) reports 56% jailbreak success on OpenAI o1-preview and 100% on DeepSeek in a multi-shot transfer scenario, using the target model’s own loss signal as a step-wise verifier to guide test-time search. The same loss landscape the entropy detector would sit next to is already a working attack surface.

What compute cost does an entropy-trajectory monitor add to inference?

The logit lens projects each intermediate layer’s hidden state through the unembedding matrix to produce a per-layer token distribution, which is an extra matrix multiply per layer per token on top of the forward pass. On a 70B-class model with dozens of layers, sampling a subset of mid-layers rather than the full stack is the practical move, since the jailbreak signal lives away from the output head anyway.

sources · 6 cited