Six months after opening Claude Code access to thousands of engineers, Microsoft is pulling most of those licenses. Fortune reported on May 221 that usage scaled far faster than budget models predicted and the company is migrating engineers to GitHub Copilot CLI instead. The cancellations are not a strategic retreat from Anthropic; the broader Microsoft-Anthropic Foundry partnership, including the $5B investment and $30B Azure commitment, remains in place.1 But at the tooling level, the unit economics broke.
The behavioral evidence, not the memo
There is no leaked Microsoft internal study documenting that per-token agent costs exceed human labor. What exists is more useful for procurement teams: a vendor with essentially unlimited bargaining power over AI infrastructure chose to cancel a popular tool because the bill grew faster than the value. Nvidia VP Bryan Catanzaro put the dynamic plainly: “For my team, the cost of compute is far beyond the costs of the employees.”1 When the VP of the company selling the compute says compute costs more than the people using it, that is the cost study.
Uber’s data point is sharper. CTO Praveen Neppalli Naga told The Information1 that Uber burnt through its entire 2026 AI coding tools budget in four months, after the company actively incentivized adoption through internal leaderboards. That is not a planning failure. That is a structural mispricing: the vendor priced the tool as if usage would plateau, and usage did what usage does when you make it free at the point of consumption.
Token-maxxation and the volume trap
Amazon is pushing employees to “tokenmaxx”1, explicitly incentivizing higher AI usage. A Meta employee built an internal “Claudeonomics” leaderboard to track who consumed the most AI resources.1 The pattern across these companies is the same: drive adoption first, measure unit economics later. This works when marginal cost is low and declining. It stops working when the marginal cost of an agentic task involves multiple model calls, tool-use loops, and context windows that inflate token counts by orders of magnitude beyond a single chat completion.
The adoption incentives also obscure a measurement problem. Nobody at these companies was tracking cost-per-completed-task before incentivizing consumption. They were tracking seats activated and tokens consumed, which measures vendor revenue, not buyer value.
The Gartner paradox: cheaper tokens, higher bills
Gartner projects that by 2030, running inference on a one-trillion-parameter model will cost roughly 90% less than it did in 2025.1 Goldman Sachs forecasts agentic AI driving a 24-fold increase in token consumption by 2030, reaching 120 quadrillion tokens per month.1
Do the arithmetic. A 90% cost reduction against a 24x volume increase yields a 2.4x net cost increase before you account for the part Gartner flagged explicitly: providers will not fully pass through the savings.1 The unit cost of a token drops. The number of tokens per task rises by a larger factor. The bill goes up. This is the structural argument that makes the Microsoft and Uber data points durable rather than anecdotal.
Where agents actually break
The cost problem is compounded by the failure profile. SaaSBench (arXiv
.17526)2 tested coding agents on realistic enterprise SaaS engineering tasks and found that over 95% of failures occurred before agents reached business logic.2 The bottleneck is multi-component system configuration and integration, not the code generation that benchmarks optimize for. Agents burn tokens navigating build systems, resolving dependency conflicts, and wiring up boilerplate, then fail before reaching the part where domain expertise matters. The expensive part of the task is the part they are worst at. .23638)3 tested 23 LLMs on a 132-month CFO simulation. Only 15.4% of trials survived the full horizon.3 Larger models did not reliably outperform smaller ones, which undermines the assumption that throwing more parameters at agentic tasks will fix the failure profile. If model size does not predict task completion, then the inference-cost-to-success-rate ratio is worse than simple token accounting suggests. .15206)4 measured the physical cost: agentic execution significantly increases GPU power draw, temperature, and battery drain compared to single-inference workloads.4 Their system can reduce wasted energy by 15-20% via predictive early termination, which is a useful optimization but also an admission that a material fraction of agentic compute is spent on tasks that will fail.4What procurement teams now have
Before this week, a CFO pushing back on per-token pricing in an enterprise AI contract had to argue from theory: “We think your costs are lower than your price.” That is a weak negotiating position because the vendor controls the cost data.
After this week, the same CFO can point to Microsoft canceling its own licenses for cost reasons, Uber exhausting a year’s budget in a quarter, and Nvidia’s VP stating publicly that compute costs exceed people costs at his own company.1 These are not the CFO’s estimates. They are the vendors’ revealed preferences. The negotiating asymmetry just shifted.
The strategic question for buyers is not whether to adopt agentic AI. It is whether to accept per-token pricing for agentic workloads before the failure-rate-to-cost ratio is characterized. The data from SaaSBench and EnterpriseArena suggests that ratio is unfavorable today.23 The behavioral evidence from Microsoft and Uber suggests the vendors already know it.1
Frequently Asked Questions
Does the cost problem apply to locally-run agents, or only cloud API usage?
The AgentStop study specifically tested agentic workloads on consumer devices, not data-center GPUs. For companies deploying local agents on employee laptops, the cost extends beyond API fees to accelerated device replacement cycles and increased IT support burden from sustained GPU thermal and battery stress — a cost category that cloud-focused procurement models ignore entirely.
If larger models don’t outperform smaller ones on agentic tasks, what should buyers actually optimize?
EnterpriseArena’s 132-month simulation found no reliable correlation between parameter count and task completion across 23 models, making cost-per-completed-task the metric that matters rather than cost-per-token. Practically, this means investing in deterministic scaffolding tooling that handles the configuration and integration layer — where SaaSBench shows 95%+ of failures occur — before invoking any LLM, reduces the denominator of failed attempts that currently make per-task cost unbounded.
How does the agentic cost curve differ from traditional cloud compute scaling?
Cloud compute costs scale roughly linearly with usage — twice the VM-hours costs roughly twice as much. Agentic costs scale superlinearly because failed attempts trigger retry loops that each consume multiple inference calls, and the most token-intensive phase (multi-step agent execution) is also the phase with the highest failure rate. This creates a curve where marginal spend produces near-zero marginal success until the underlying integration failure mode is addressed, unlike cloud workloads where each additional dollar buys a proportional increase in capacity.
What happens to AI budgets if token consumption actually reaches Goldman’s projected 24x by 2030?
At 120 quadrillion tokens per month, enterprise AI spend would shift from an engineering-tool line item to a top-five operating expense category for most technology companies — comparable to current cloud infrastructure spend. Since Gartner projects providers won’t fully pass through the projected 90% inference cost reduction, buyers will absorb nearly all of the 24x volume increase while capturing a fraction of the unit-cost savings. Companies that negotiate volume-committed rates or per-outcome pricing now, before agentic adoption broadens internally, will hold a structural cost advantage over those renegotiating after consumption patterns are locked in.
Footnotes
-
Microsoft reports are exposing AI’s real cost problem ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11
-
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering ↩ ↩2 ↩3 ↩4
-
Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation ↩ ↩2 ↩3
-
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices ↩ ↩2 ↩3