Which Layer Detects LLM Hallucinations Best? The Case Against Fixed-Layer Probes

Most hidden-state hallucination detectors read from a single layer of a transformer, typically the last or middle one, and treat that choice as a constant. A paper accepted at ICML 2026 argues this is wrong: the layer that carries the strongest truthfulness signal shifts depending on the model, the dataset, and the task, and hard-coding a fixed layer leaves detectable errors on the table. The authors propose an automated, training-free method for picking the right layer per checkpoint.

Why your probe is reading the wrong layer

The idea behind hidden-state hallucination detection is straightforward: train a lightweight classifier on a model’s internal activations and let it flag outputs likely to be fabricated. The approach dates to work by Azaria & Mitchell (2023) and was extended by Orgad et al. (2025), but both lines of work either selected a predetermined layer (middle or final) or swept a sparse grid of layers without a principled selection criterion.

Wang et al. show that the optimal layer for detection varies substantially across architectures and datasets. A layer that carries high signal for question-answering hallucinations in LLaMA-3.1-8B-Instruct may be nearly useless for summarization hallucinations in Mistral-7B-Instruct-v0.3. The practical consequence: a detector calibrated on one model family and ported to another with the same layer index is likely operating below its ceiling, though the exact margin is an authors-reported claim and the underlying per-layer tables were not independently verified.

What FEPoID does

The paper’s proposed solution is called FEPoID: First Effective Peak of Intrinsic Dimension. The method is training-free and works as follows:

Compute the intrinsic dimension of the hidden-state representations at each layer of the model, using a standard estimator applied to a representative sample of inputs.
Scan layers from input to output and identify the first peak in the intrinsic-dimension curve.
Use that layer’s hidden states as the input to the hallucination classifier.

The hypothesis is structural: earlier layers encode abstract semantic content (including whether a claim is grounded or confabulated), while later layers are dominated by surface-level token prediction. An earlier intrinsic-dimension peak, the authors argue, marks the point where the model’s internal representation is richest in task-relevant semantics before it gets compressed into output-formatted form.

This is plausible but unproven as a causal mechanism. The correlation between ID peaks and detection accuracy is an empirical observation across the tested models and benchmarks, not a derived result from information theory.

The paper also evaluated existing layer-selection criteria, including information-theoretic measures, gradient-based approaches (relative gradient norm, signal-to-noise ratio), and geometric measures (curvature). None of these consistently identified high-performing layers across the diverse architectures and tasks tested, according to the authors’ reported results (tables not independently verified).

First-Sentence Truncation: a cheap improvement

Separate from layer selection, the paper introduces a second technique that requires no architectural changes: First-Sentence Truncation (FST).

Conventional hidden-state probes extract representations at the last generated token. FST instead extracts at the last token of the first generated sentence. The reasoning is that as generation continues past the first sentence, the hidden states accumulate noise from degenerate repetition, inconsistent continuations, and semantic drift. By reading the model’s internal state at the point where it has just committed to its first substantive claim, you capture a cleaner signal.

The authors report that FST consistently outperforms the last-token heuristic across all baselines tested (specific AUROC deltas are authors’ claims from the paper; per-benchmark tables were not independently verified). The improvement holds regardless of whether the detector uses FEPoID-selected layers or a fixed layer.

What this costs: per-checkpoint calibration

The uncomfortable implication of FEPoID is that hallucination detector portability is worse than the field assumed.

If the optimal detection layer shifts across model families and even across fine-tuned variants of the same base model, then shipping a detector means running an intrinsic-dimension sweep each time you change the underlying checkpoint. That sweep is not expensive in absolute terms; computing intrinsic dimension on a sample of hidden states is a batch operation that takes minutes on a single GPU, not hours. But it is a step the field previously ignored. Prior work treated layer selection as a one-time design choice. FEPoID makes it a per-deployment calibration task.

For teams operating a fleet of models (different sizes, different fine-tunes, different vendors), the operational cost is multiplicative. Each new checkpoint needs its own sweep and its own probe calibration. The code for FEPoID is publicly available, so the implementation burden is low, but the process discipline is new.

Evaluation caveats and the broader landscape

Any claim about hallucination detection accuracy needs to be read against a separate body of evidence that the benchmarks themselves may be misleading. A 2025 study titled “The Illusion of Progress” found that ROUGE-based evaluation systematically overestimates hallucination detection performance, with some established methods suffering up to 45.9% AUROC drops when re-evaluated with human-aligned metrics (this figure is from the original EMNLP 2025 paper as cited in the EdinburghNLP repository; not independently re-verified). That study was published at EMNLP 2025.

Applied to the FEPoID results: the reported improvements are measured against standard benchmarks. If those benchmarks overstate detection quality in general, the absolute numbers are upper bounds. The relative claim (that FEPoID-selected layers outperform fixed layers) may still hold, but the gap in real-world deployment could be smaller or larger than benchmark deltas suggest.

The broader field of hallucination detection is active and fragmented. Recent work has examined reasoning models producing more hallucinations under certain conditions (NeurIPS 2025), multi-benchmark detection frameworks like Cognometry, and the persistent gap between automated metrics and human judgment. None of these lines directly address the layer-selection problem FEPoID targets, which is part of why the paper’s contribution is specific and bounded: it does not claim to solve hallucination detection, only to fix a suboptimal preprocessing step that most existing detectors share.

The FST finding may prove more durable than FEPoID itself. Even if the intrinsic-dimension hypothesis turns out to be a correlation without a causal anchor, extracting hidden states earlier in the generation sequence is a simple, low-risk intervention with no downside beyond changing one line of probing code. FEPoID is the more ambitious claim, and the one that will need replication across model families and scales beyond the two tested (LLaMA-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3) before it becomes a default practice.

Frequently Asked Questions

Does FEPoID work on closed-weight models like GPT-4 or Claude?

No. FEPoID requires access to internal hidden states at every layer, which proprietary API providers do not expose. The technique is restricted to open-weight models where you control the inference stack and can extract activations. Prompt-based and output-probability methods remain the only detection options for closed-weight models, though they trade away the internal signal that hidden-state probing provides.

When does First-Sentence Truncation provide no benefit?

When the model answers in a single sentence, the first-sentence boundary and the last-generated token occupy the same position, making FST and the standard heuristic identical. The technique only adds signal for multi-sentence outputs where tokens produced after the first claim introduce noise. For short-form QA benchmarks that elicit one-sentence responses, the two extraction points converge.

Why did gradient-based and geometric layer-selection criteria fail?

Gradient-based methods like relative gradient norm track training dynamics and loss curvature, which capture how the model learns rather than how it represents truthfulness at inference time. Geometric measures such as curvature reflect local manifold structure without distinguishing between semantic content and surface token prediction. FEPoID targets the representational richness of each layer’s activation space directly via intrinsic dimension, which empirically correlates with detection accuracy, though the paper demonstrates correlation without proving causation.

Has FEPoID been validated on models larger than 8 billion parameters?

The published experiments cover only LLaMA-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3. Models at 70B and above have substantially more layers, and their intrinsic-dimension curves may exhibit multiple peaks that complicate the first-peak heuristic. Whether the first peak in an 80-layer model still corresponds to the richest semantic representation is unresolved. Teams running probes on larger checkpoints should verify the selected layer independently before relying on FEPoID.