Do Concept Bottleneck Model Benchmarks Measure Interpretability or Dataset Bias?

Concept bottleneck models promise interpretability by forcing predictions through human-readable concepts. But a new preprint from Skirzynski et al., posted June 3, argues that the benchmarks used to evaluate those models confound genuine concept learning with spurious dataset correlations. If the claim holds, reported interpretability gains across the CBM literature may be artifacts of the test sets rather than evidence that models learned anything causal.

What Concept Bottleneck Models Promise, and Where Evaluation Breaks Down

Concept Bottleneck Models, introduced by Koh et al. at ICML 2020, replace the standard end-to-end classifier with a two-stage architecture: a backbone predicts human-interpretable concepts (e.g., “bone spurs present,” “wing coloration”), and a second layer maps those concepts to the final label. The payoff is supposed to be twofold. The model is auditable because a human can inspect the concept activations. And the model is intervenable because a domain expert can correct a wrong concept at inference time and propagate the fix downstream.

The original paper demonstrated competitive accuracy with black-box baselines on x-ray grading (MIMIC-CXR) and bird species identification (CUB-200), according to authors-reported results. The claim was compelling enough that CBMs became a standard reference point in the interpretable-ML literature: if you want a model you can trust, force it to reason through concepts.

The evaluation problem went mostly unexamined. Natural-image datasets like CUB-200 and MIMIC-CXR contain rich statistical regularities that have nothing to do with the labeled concepts. Background texture correlates with bird species. Image intensity distributions correlate with pathology grades. A model that exploits these shortcuts can achieve high concept-accuracy scores without ever learning the concept the label describes. The benchmark reports a number. The number confounds two things.

The Confound: Spurious Correlations in Natural-Image CBM Benchmarks

The core claim in arXiv:2606.04326 is structural, not empirical. On any natural-image dataset, four evaluation properties cannot be independently varied: data modality, concept choice, annotation quality, and concept completeness. If you want to test how a CBM degrades when annotations are noisy, you cannot hold everything else constant because the noise is confounded with the image distribution. If you want to test how concept completeness affects downstream accuracy, you cannot disentangle missing concepts from the visual shortcuts the model uses to compensate.

This is not a hypothetical concern. Kim et al. (2024) found what they termed “undesirable biases in CBMs built on pre-trained models” and proposed a foundation-model pipeline to construct CBMs with minimal human effort while being immune to those biases. The phrasing is careful. The biases exist because pre-trained representations carry spurious correlations from their training data into the concept layer. The pipeline works around them. It does not eliminate the underlying confound.

M-CBM, accepted at ICLR 2026, made the point more directly: prior CBMs “often significantly trail their black-box counterpart when controlling for information leakage.” The M-CBM approach extracts concepts from Sparse Autoencoders fitted to a black-box model’s activations, rather than specifying concepts a priori. The information-leakage control is what matters. When you actually prevent the model from accessing shortcut signals, the concept bottleneck costs accuracy, sometimes substantially. That result is authors-reported and specific to the M-CBM experimental setup, but it aligns with the structural argument from the synthetic-benchmark paper: uncontrolled evaluations inflate CBM performance.

Inside the Synthetic Benchmarks

The Skirzynski et al. contribution is an evaluation framework, not a new model. The benchmarks generate labeled synthetic datasets where modality, concept set, annotation noise, and concept completeness are independently controllable parameters. This means you can run the same CBM architecture under conditions that differ in exactly one variable and observe how the evaluation metric responds.

The paper frames the evaluation around two deployment scenarios. In decision-support mode, the model’s concept predictions are presented to a human who makes the final call. In automation mode, the model acts unsupervised, and the concept layer is supposed to provide post-hoc interpretability. The failure modes differ between the two. In decision support, noisy or incomplete annotations degrade the human’s ability to trust or correct the model’s reasoning. In automation, a model that achieves high concept accuracy through spurious correlations gives a false sense of interpretability because the “concepts” the model activates are not causally connected to the prediction.

The benchmarks are publicly available, which matters. If the central claim is that prior evaluations are confounded, the remedy has to be reproducible. A closed benchmark that asserts confounds in open benchmarks would be a hard sell.

What the Benchmarks Reveal About Failure Modes

According to authors-reported results, the synthetic benchmarks can diagnose specific failure modes in both deployment scenarios. When annotation quality degrades, decision-support accuracy drops in a predictable way that depends on concept completeness, not just noise magnitude. When concept completeness is reduced, automation-mode models compensate by exploiting residual statistical signals in the data, maintaining task accuracy while concept fidelity declines.

This is the second-order finding that matters for practitioners. A CBM evaluated on a natural dataset might report high concept accuracy alongside high task accuracy. On a synthetic benchmark with controlled completeness, the same architecture can reveal that only a fraction of the specified concepts actually contribute to the prediction, with the residual “accuracy” coming from shortcut signals that the natural benchmark cannot detect because it never varies concept completeness in isolation.

The paper’s watch-out is worth repeating: synthetic benchmarks expose confounds, they do not replace natural benchmarks. The argument is for controlled supplementary evaluation, not wholesale abandonment of existing test sets.

The Information-Leakage Thread

The synthetic-benchmark paper does not emerge in isolation. The CBM evaluation problem has been accumulating evidence for roughly two years.

IEEE’s 2026 publication on measuring and addressing information leakage in CBMs indicates that leakage metrics are now a recognized research thread with their own proposed measurements. The paper introduces formal ways to quantify how much information passes through the concept bottleneck that is not captured by the labeled concepts. This is the same structural concern that motivates the synthetic benchmarks: if the bottleneck leaks, the concept accuracy numbers are not measuring what they claim to measure.

M-CBM’s contribution, according to authors-reported findings, is to show that Sparse Autoencoder-derived concepts recover more of the black-box model’s decision boundary than a priori-specified concepts, precisely because the SAE fits to what the model actually uses rather than what the designer assumes it should use. The implicit criticism of standard CBM construction is that hand-specified concept sets are incomplete, and the model fills the gaps with leaked information.

The synthetic benchmarks from Skirzynski et al. extend this line by giving researchers a controlled environment to measure the gap between specified concepts and actual model behavior. The three threads together form a coherent critique: standard CBM evaluations overestimate concept fidelity because natural datasets contain shortcut signals, and the leakage metrics needed to detect this are still under development and not yet standardized.

When to Demand Synthetic Validation

The practical question is straightforward. If you are deploying a CBM in a domain where interpretability is the justification for using it rather than a black-box classifier, what evidence should you require before trusting the concept layer?

Three conditions from the accumulated literature:

The concept set must be complete enough that the model cannot compensate for missing concepts by exploiting dataset shortcuts. The synthetic benchmarks demonstrate that incomplete concept sets produce misleading accuracy numbers. Testing under controlled completeness levels is the only way to measure the gap.

Annotation quality must be varied independently of the data distribution. On natural datasets, annotation noise correlates with image characteristics (harder images get noisier labels). The synthetic benchmarks break this correlation, revealing how the model responds to noise alone.

Information-leakage metrics should be reported alongside concept accuracy. If the leakage metrics are not yet standardized, the minimum bar is to report them using the best available method and flag the measurement as preliminary.

For teams deploying CBMs in medical imaging, credit decisioning, or any domain where the interpretability guarantee is load-bearing, the synthetic benchmarks from Skirzynski et al. provide a concrete starting point. The cost of skipping controlled evaluation is not a degraded model. It is a model that performs well on the benchmark while the interpretability property you deployed it for is absent. That is a worse failure than a model that is visibly inaccurate, because the numbers look correct and the audit trail looks clean while the underlying mechanism is doing something else entirely.

Frequently Asked Questions

Do these confound findings extend to tabular or time-series CBMs, or only image domains?

The synthetic framework controls for data modality as an independent variable, suggesting modality-agnostic design, but all reported results use image-like synthetic data. On tabular domains such as credit scoring or sensor analytics, the shortcut problem may manifest differently. Structured features tend to have more direct, linear correlations with target labels than pixel patterns do, which can make information leakage detectable through standard feature-importance audits without requiring synthetic controls.

How does the synthetic-benchmark approach differ from counterfactual intervention to test concept fidelity?

Counterfactual intervention flips a concept value and checks whether the prediction changes, testing whether a concept is causally linked to the output on a specific input. The synthetic benchmarks test whether the evaluation setup itself can separate genuine concept learning from shortcut exploitation. A CBM could pass a counterfactual test on natural data (flipping “bone spurs” flips the diagnosis) while still relying on intensity-distribution shortcuts that correlate with both the concept and the label. Synthetic benchmarks catch this by varying the correlation structure directly.

What should a team do when synthetic-benchmark results conflict with strong natural-dataset performance?

The paper positions synthetic benchmarks as diagnostic supplements, not replacements. A conflict signals that the natural dataset contains shortcut signals the CBM is exploiting. For decision-support deployments, identify which concepts show the largest accuracy gap between synthetic and natural evaluation and have domain experts probe those concepts manually. For automation, the conflict itself is evidence that the interpretability guarantee is unreliable: the model should not ship without expanding the concept set or adding a leakage metric such as NCC or NEC as a continuous monitor.

Could mature leakage metrics make synthetic benchmarks unnecessary?

If NCC and NEC were standardized and independently validated, they could quantify confounds directly on natural data without synthetic controls. The blocker is circularity: these metrics are currently evaluated on the same natural datasets whose confound status is uncertain. Standardization requires an independent leakage ground truth, which is precisely what synthetic benchmarks supply. The two approaches are complementary rather than redundant: synthetic benchmarks calibrate the metrics, and the calibrated metrics enable ongoing monitoring on real deployments.