groundy
ethics, policy & safety

Do Word-Subset Explanations Satisfy the EU AI Act's Transparency Rule?

A KDD 2026 paper attributes LLM outputs to input words without model access, but shows which tokens mattered, not how the model reasoned, creating an EU AI Act compliance gap.

6 min · · · 4 sources ↓

A post-hoc explanation method that works without model access

A June 2026 preprint from Minyoung Hwang, accepted to the KDD 2026 Research Track, proposes explaining black-box language model predictions by selecting a small subset of input words that are most responsible for the output. The method requires no access to model weights or gradients. It works on API-only deployments. And it claims to beat both conventional black-box explanation methods and gradient-based approaches that do have oracle access to model internals.

The question the title asks is whether explanations like these satisfy transparency mandates under the EU AI Act. The short answer: they satisfy a narrow, technical reading of “which input features mattered,” but they do not explain why the model reached its conclusion. That distinction matters for anyone deploying a high-risk AI system in Europe this year.

How word-subset selection works

Most post-hoc explanation methods for language models fall into one of two camps. Gradient-based methods (Integrated Gradients, SHAP variants) need access to the model’s internal differentiable computation graph. Perturbation-based methods (like LIME for text) work on black boxes but tend to produce explanations that are unstable or linguistically incoherent, because they treat each token independently.

Hwang’s method occupies a middle position. It trains a selection policy using REINFORCE-style policy gradients to choose a discrete subset of input words. Because the selection is discrete, standard backpropagation does not apply; the policy gradient formulation sidesteps that. And because the policy only observes input-output pairs from the target model, no parameter or gradient access is needed.

The linguistic structure component is what separates this from a naive token-selection heuristic. The selection policy incorporates graph-structured linguistic knowledge, so the chosen word subsets correspond to syntactically and semantically coherent spans rather than scattered individual tokens. The paper’s stated goal is explanations that end-users can actually interpret and act on, per its abstract on arXiv.

Reported results, with caveats

The authors report that their method outperforms both conventional black-box explanation baselines and gradient-based methods given oracle access to model gradients, across “diverse DLM architectures and real-world datasets” (arXiv:2606.08497).

That is a strong claim. The paper’s abstract does not disclose specific benchmark numbers, dataset names, or the magnitude of improvement. The full PDF may contain those details; this article works from the abstract and metadata. The claim is plausible given the KDD Research Track acceptance (which involves peer review), but independent replication has not been reported as of June 2026.

The transparency gap: attribution is not reasoning

Here is where the regulatory relevance sharpens.

Word-subset attribution answers a specific question: “Which tokens in the input most influenced this output?” It does not answer: “What reasoning path did the model follow?” or “What internal representations activated to produce this response?”

These are different questions, and the difference is not academic. A credit-scoring model might attribute its “reject” decision to the words “bankruptcy” and “2023” in an applicant’s text summary. That tells you what the model attended to. It does not tell you whether the model learned a proxy for a protected class, whether it applied the same logic to similar cases, or whether the decision boundary is consistent across demographic groups.

Feature attribution and reasoning explanation are complementary, not interchangeable. The paper is forthright about delivering the former. The compliance risk arises when deployers treat the former as satisfying legal obligations that were written with the latter in mind.

EU AI Act implications

The EU AI Act establishes a transparency framework for high-risk AI systems deployed in the European market. The regulation does not prescribe a specific technical method for explaining AI outputs. Its requirements are deliberately technology-neutral, which avoids mandating specific techniques but leaves deployers uncertain about what counts as sufficient in practice.

A word-subset explanation from Hwang’s method provides more than nothing. It is technically sound, linguistically coherent, and deployable without model access. Whether it satisfies regulatory transparency expectations depends on the use case. For a content-moderation flag where knowing which words triggered the classification is sufficient, it may pass. For a hiring recommendation where the deployer must demonstrate the model did not rely on protected characteristics, it almost certainly does not.

The gap is not in the method. The gap is in what a black-box explanation can deliver versus what a regulator will demand. No amount of clever word selection will surface a model’s decision boundary, training data biases, or internal failure modes. Those require model access, documentation, and testing regimes that exist upstream of any post-hoc explanation.

What deployers should actually do

Relying on any single post-hoc explanation method to satisfy transparency obligations is a compliance bet, not a compliance strategy. The practical path forward layers multiple signals:

Model access where possible. If you control the model, gradient-based methods and internal probing provide strictly more information than black-box attribution. Use them.

Post-hoc attribution as a complement, not a substitute. Methods like Hwang’s are useful for API-only deployments where model access is unavailable. They can surface which inputs influenced an output. They cannot certify that the model’s reasoning is sound.

Documentation as the primary compliance artifact. The EU AI Act’s transparency obligations are best satisfied by technical documentation of the model’s training data, evaluation procedures, known limitations, and failure modes. Post-hoc explanations are supporting evidence, not the main exhibit.

Legal review before technical choices. The adequacy of an explanation method is a legal judgment, not a technical one. Engineering teams should not be making unilateral calls about what satisfies transparency obligations without input from counsel familiar with EU digital regulation.

Hwang’s paper is a genuine technical contribution: a gradient-free, linguistically structured attribution method that the authors show outperforms alternatives on their benchmarks. That contribution should be evaluated on its own terms. Conflating it with what transparency regulations require does a disservice to both the research and the regulation.

Frequently Asked Questions

How does the amortized policy differ from LIME’s per-instance approach?

LIME fits a local linear surrogate from scratch for every prediction, which makes it slow on large input vocabularies and sensitive to sampling noise. Hwang’s REINFORCE policy trains once on input-output pairs and generalizes to new inputs without re-optimizing. That trades an upfront training cost for faster, more stable explanations at inference time, but the policy can smooth over input-specific quirks that per-instance surrogate fitting would catch.

Does word-subset selection work for generative outputs, or only classification?

The paper benchmarks on classification-style tasks across diverse language model architectures. For generative outputs like summarization or open-ended dialogue, the explanation target is ambiguous: you would need to map which input tokens drove which specific output span, and the current single-prediction formulation does not define that mapping. Extending the method to generative settings would require per-output-span attribution, which is an unsolved problem for black-box methods generally.

Can a token-attribution method detect when a model relies on relational patterns between sentences?

No. Neural language models distribute information across hidden-state dimensions, not discrete tokens. Word-subset selection identifies which tokens correlated with the output but cannot surface relational reasoning, such as a model inferring sentiment from the contrast between two adjacent sentences rather than from any single word. This is a structural limitation of all input-attribution methods, regardless of how linguistically coherent the selected subset is.

What does the EU’s recent GDPR enforcement activity signal for AI Act transparency guidance?

The EU marked the GDPR’s tenth anniversary on May 22, 2026, a reminder that European regulators have a decade of institutional experience enforcing technical compliance on digital systems. If AI Act transparency obligations follow the same trajectory, expect sector-specific implementing guidance that narrows what counts as an adequate explanation. Methods that pass a general plausibility test today may fall short once national authorities issue domain-specific interpretability standards for hiring, credit, or medical AI.

sources · 4 cited

  1. Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets primary accessed 2026-06-09
  2. Your gateway to the EU vendor accessed 2026-06-09
  3. ArXiv analysis accessed 2026-06-09
  4. European Union analysis accessed 2026-06-09