Stanford's 2026 AI Index: Frontier Model Transparency Scores Collapsed 31% in One Year

The most capable AI models released in 2025 are also the least transparent ones. According to the 2025 Foundation Model Transparency Index (FMTI), incorporated into Stanford HAI’s 2026 AI Index released on April 13, 2026, average transparency scores across major frontier model developers collapsed from 58/100 in 2024 to 40.69/100 — erasing two years of progress and returning the industry to roughly where it stood in 2023. (The 2025 Foundation Model Transparency Index — arXiv

.10169)

What the FMTI Actually Measures

The FMTI is not a capability benchmark. It is an accountability audit covering 100 binary indicators organized into three domains: upstream (how training data was acquired, compute used, and methods applied), model (architecture documentation, risk evaluations, mitigations, and release conditions), and downstream (usage data, incident reporting, and acceptable use policies). (The 2025 Foundation Model Transparency Index — arXiv

.10169)

The biggest opacity gaps are in training data properties, where companies average just 15% disclosure, and environmental impact. These are not edge cases — they are core questions any regulated deployment team would need answered before putting a model in production.

The Score Collapse in Numbers

The 2024 average of 58/100 had represented genuine progress from the 2023 baseline of approximately 37. The 2025 results at 40.69 effectively reset the clock. (The 2025 Foundation Model Transparency Index — arXiv

.10169)

The steepest individual drops:

Company	2024 Score	2025 Score	Change
Mistral	55	18	−37
Meta	60	31	−29
OpenAI	(not disclosed)	(lower)	−14
Google	(not disclosed)	(lower)	−6
Anthropic	(not disclosed)	(lower)	−5

At the extremes: IBM scored 95/100 — the highest in FMTI history and the only evaluated company enabling external replication of training data. xAI and Midjourney both scored 14/100, among the lowest ever recorded. (The 2025 Foundation Model Transparency Index — arXiv

.10169)

One important caveat on IBM: its most capable models are not frontier-scale systems competing with GPT-4 class or Gemini Ultra class models. The 95/100 score is meaningful for what IBM disclosed, but the comparison to companies deploying significantly larger and more capable systems is not perfectly symmetrical.

The Three Silent Withdrawals

The most consequential finding in the 2026 AI Index is not a single company’s collapse but a simultaneous rollback by the industry’s three leading labs. As of the 2026 AI Index release date, Google, Anthropic, and OpenAI have all discontinued disclosing dataset sizes and training duration for their latest flagship models. (Stanford HAI 2026 AI Index: China and US Now Neck and Neck — SiliconAngle)

These are not obscure details. Dataset size informs bias and coverage questions. Training duration, combined with compute disclosures, determines whether independent parties can estimate resource costs and environmental impact. When all three labs withdraw these disclosures at roughly the same time, there is no longer a competitive dynamic pushing any of them to differentiate on openness.

Meta’s 29-point drop followed a specific trigger: the release of Llama 4 without an accompanying technical report — a document that had been standard practice for prior Llama generations. (The 2025 Foundation Model Transparency Index — arXiv

.10169) Mistral’s 37-point decline is the sharpest single-company drop in the dataset.

Open Weights ≠ Open Methodology

One of the most persistent misconceptions in this space is that releasing model weights is equivalent to transparency. The FMTI data directly contradicts this.

Among open-weight developers, the divergence is dramatic: IBM and AI21 Labs average 82.5/100 on the FMTI, while Alibaba, DeepSeek, and Meta average just 25.3/100. (The 2025 Foundation Model Transparency Index — arXiv

.10169) All of these companies release weights. None of the latter group provides the training data provenance, compute details, or incident reporting that the FMTI tracks.

Downloading a model’s weights tells you what the model does. It tells you almost nothing about what data it was trained on, how long it ran, what evaluations were performed, or what failure modes were identified internally. For a practitioner doing due diligence before a regulated deployment, those omissions are the ones that matter.

The Frontier Model Forum Cluster

All five Frontier Model Forum members — Amazon, Anthropic, Google, Meta, and OpenAI — score within the middle band of the FMTI. (The 2025 Foundation Model Transparency Index — arXiv

.10169) The FMTI researchers describe this as consistent with a “common incentive” pattern: avoid the lowest ranks, which would attract scrutiny, without differentiating through strong transparency, which would create competitive pressure to disclose more.

This is worth being precise about. The FMTI paper does not claim coordination. It observes that all five companies have landed in the same range and notes the incentive structure that would produce that outcome without any direct communication. Whether the clustering reflects deliberate strategy, parallel decision-making, or coincidence is not something the data can resolve.

What the data does show is that the Forum’s voluntary framework has not produced differentiation on transparency. The floor and the ceiling are effectively the same for all five members.

What This Means for Auditors and Responsible Deployment

For teams deploying frontier models in regulated sectors — finance, healthcare, legal, government — the FMTI’s 100-indicator structure is more useful than its summary score. The indicators are public and specific enough to convert a vague “black box” concern into a concrete gap analysis.

The three domains map directly to due-diligence questions:

Upstream: Can you identify what data was used to train this model? Can you assess potential biases from data provenance? Do you know the compute footprint?

Model: Is there documentation of internal risk evaluations? Are known failure modes disclosed? Under what conditions was the model released?

Downstream: Is usage telemetry available? Are incidents reported publicly? Is there a clear acceptable-use policy with enforcement mechanisms?

For the three leading labs, as of April 2026, the answers to most upstream questions are now “no” — a change from prior model generations where at least some of this information was available in technical reports.

The Regulatory Forcing Function

The EU AI Act entered full enforcement in January 2026, creating the first hard legal floor for transparency obligations on high-risk AI systems. (Inside the AI Index: 12 Takeaways from the 2026 Report — Stanford HAI) EU AI Act Code of Practice signatories score marginally higher on the FMTI than non-signatories, according to the 2026 AI Index — though the magnitude of that difference is characterized as marginal rather than substantial.

The enforcement timeline matters because it changes the calculus for both deployers and developers. Under voluntary frameworks, the cost of opacity is reputational. Under the EU AI Act, it is legal. Developers selling into EU markets for high-risk applications will need to produce documentation that, in many cases, they have stopped generating.

The 10-of-13 evaluated companies that disclose zero information about energy usage, carbon emissions, or water consumption (The 2025 Foundation Model Transparency Index — arXiv

.10169) face a specific downstream problem: several EU member states and the Act itself include environmental reporting provisions. A company that stopped tracking those figures in 2025 will face reconstruction costs in 2026 that preventive disclosure would have avoided.

FAQ

Why did transparency scores fall so sharply in a single year?

The FMTI researchers point to two convergent dynamics: intensified competitive pressure between frontier labs (which creates incentives to withhold training methodology as proprietary), and the transition from research-lab culture to commercial-product culture at companies like Meta and Mistral. Earlier Llama generations were released with academic-style technical reports. Llama 4, released as a commercial product, was not. (The 2025 Foundation Model Transparency Index — arXiv

.10169)

Is the IBM 95/100 score a meaningful comparator?

With caveats. IBM’s Granite models — the ones evaluated — are not frontier-scale systems in the sense of competing with the largest closed models from OpenAI, Google, or Anthropic. IBM’s transparency practices are genuinely strong on the FMTI’s indicators, but the comparison has an asymmetry: IBM is not disclosing the training details of a model trained at frontier compute scales. The disclosure costs and competitive sensitivity are different. (The 2025 Foundation Model Transparency Index — arXiv

.10169)

What is the practical difference between training data disclosure and weight release?

Weight release allows researchers to run the model, probe its behavior, and potentially extract some information about training through analysis. Training data disclosure tells you directly what sources were used, what was excluded, and what consent or licensing governed data acquisition. These are complementary, not substitutable — and the FMTI measures the latter, not the former.

Stanford's 2026 AI Index: Frontier Model Transparency Scores Collapsed 31% in One Year

What the FMTI Actually Measures

The Score Collapse in Numbers

The Three Silent Withdrawals

Open Weights ≠ Open Methodology

The Frontier Model Forum Cluster

What This Means for Auditors and Responsible Deployment

The Regulatory Forcing Function

FAQ

Sources

Enjoyed this article?

What the FMTI Actually Measures

The Score Collapse in Numbers

The Three Silent Withdrawals

Open Weights ≠ Open Methodology

The Frontier Model Forum Cluster

What This Means for Auditors and Responsible Deployment

The Regulatory Forcing Function

FAQ

Sources

Related Articles

The AI Grief Split: When Emotional Bonds with Language Models Break

America's AI Researcher Pipeline Dropped 89%. What the Stanford Index Means for Teams Hiring AI Engineers

Atlassian Turned On AI Training Data Collection by Default — Here's What to Disable

Enjoyed this article?