Microsoft's 2026 Cost Math Forces CrewAI and LangGraph Users to Audit Token Spend Per Agent

Microsoft is canceling most of its internal Claude Code licenses six months after encouraging thousands of engineers to use the tool, because per-agent token bills now exceed what those salaried engineers cost. Fortune reported on 2026-05-22 that usage scaled past projections so fast that the economics inverted: the tool became more expensive than the human it was supposed to augment. For teams running multi-agent frameworks, this is not a Microsoft problem. It is a preview of their own budget line.

The Signal That Broke the Token Math

Microsoft’s pullback is not a judgment on Claude Code’s capability. It is an accounting correction. The company redirected engineers to GitHub Copilot CLI after internal cost tracking revealed that per-agent token consumption, across hundreds of parallel workflows, had overtaken the fully loaded cost of the engineers running those workflows. Nvidia VP Bryan Catanzaro confirmed the same pattern holds inside his organization: “For my team, the cost of compute is far beyond the costs of the employees.” That is the hardware vendor saying the thing it sells costs more than the people who use it.

Uber hit the same wall from a different angle. CTO Praveen Neppalli Naga told The Information (via Fortune) that Uber burned through its entire 2026 AI coding tools budget in four months. The company had actively driven adoption through internal leaderboards ranking teams by AI tool usage. More usage won. The bill arrived faster than anyone modeled.

These are not outliers. They are the first large organizations with enough internal transparency to publish the numbers most companies are still treating as sensitive.

The New Cost Paradox

The forecasts say inference gets cheaper. Gartner predicted that running a one-trillion-parameter model will cost roughly 90% less by 2030. That is a forecast, not a current data point. Meanwhile, Goldman Sachs projected that agentic AI could drive a 24-fold increase in token consumption by 2030, reaching 120 quadrillion tokens per month. Both can be true simultaneously: tokens get cheaper, and bills go up because agentic systems consume orders of magnitude more of them per task than a single chat completion.

Microsoft’s own Agent Factory uses consumption-based metering, and IBRS warns that parallel plans, retries, and recursive calls can spike costs well beyond forecast. Their assessment: Agent 365 could be “out of control on day one.” Security Copilot SCUs are priced at $6 each beyond the free allocation of 400 per 1,000 users. These are not edge-case scenarios. They are the default behavior of autonomous agents doing what they were designed to do: retry, branch, and call tools until the task completes.

The Observability Gap in Open-Source Agent Frameworks

This is where the cost problem collides with the tooling most practitioners are actually using. The three dominant open-source agent frameworks, CrewAI, LangGraph, and AutoGen, were built to solve orchestration: how do you coordinate multiple LLM calls, maintain shared state, and handle tool use across agents. Cost attribution was not a design constraint. It is becoming one.

CrewAI. The open-source framework’s task callback shape does not natively emit per-step token counts. CrewAI’s AMP platform markets “Tracing & Observability” and a “Unified Control Plane,” but that is the paid tier. The open-source layer gives you task completion status and agent-level logging. It does not give you a breakdown of how many tokens Agent 3 burned on Tool Call 7 of Task 12. Teams that need that granularity are bolting on Langfuse exporters or writing OpenTelemetry shims around the Task and Agent classes.

LangGraph [unverified, based on pre-existing framework knowledge]. LangGraph’s checkpointer persists graph state between steps, which is useful for resumption and debugging, but the checkpointer is state, not telemetry. It records what happened. It does not record what it cost. Token counts per node execution, per tool invocation, per conditional branch, are not surfaced as first-class outputs. Teams instrumenting LangGraph for cost are wrapping the graph execution in callbacks or using LangSmith’s tracing layer, which works but requires the LangSmith subscription and introduces a vendor dependency the framework itself avoids.

AutoGen [unverified, based on pre-existing framework knowledge]. AutoGen’s GroupChat maintains a shared message log across agents. The log captures conversational turn-taking and tool responses. It does not natively decompose token consumption by agent or by tool call within a multi-turn exchange. For a two-agent workflow that is manageable. For an eight-agent pipeline with conditional routing and tool retries, the message log tells you what happened, not what it cost, and not which agent ran over budget.

What First-Class Cost Telemetry Would Require

The gap is not theoretical. EY’s Sameer Gupta told CNBC: “The biggest challenge is not measuring usage, it is proving attribution. Leaders can see where AI is being used and where productivity appears to improve, but isolating AI as the primary driver is hard.”

Solving this at the framework level would require three primitives that none of the major frameworks ship today as default, first-class outputs:

Per-step token emission. Every node, every tool call, every LLM invocation in the graph emits input tokens, output tokens, and model identifier as structured data, not as a log line to be parsed.
Per-agent accumulation. The framework maintains a running cost counter per agent identity, so you can query “how much did the planner agent cost vs. the executor agent” without post-hoc log correlation.
Per-task budget envelopes. The ability to set a token ceiling at the task or workflow level, with a defined behavior (halt, degrade to a cheaper model, alert) when the ceiling is hit.

CrewAI’s AMP tier is closest to offering some of this as a managed service. LangSmith provides tracing for LangGraph but ties it to the LangChain ecosystem. AutoGen’s tracing story remains thin. None of these are framework primitives. They are add-ons, and that distinction matters when the CFO asks for per-agent cost reports and the answer is “we need to rewrite our callback layer first.”

The Second-Order Effect

The first-order consequence is already visible: Microsoft and Uber have internal cost data, and they are pulling back on tools they promoted six months ago. The second-order consequence is what happens when this pattern reaches the procurement stage at enterprises running custom multi-agent deployments on open-source frameworks.

Those enterprises will not have Microsoft’s internal telemetry. They will have a CrewAI crew that runs 14 agents across eight tasks, a monthly API bill that doubled last quarter, and no way to tell which agent caused the spike. The framework that ships first-class cost telemetry in its next minor version wins those procurement cycles. The ones that don’t will be audited out of pilots before the next budget review, not because they cannot orchestrate agents, but because nobody can tell the CFO what the orchestration cost.

Frequently Asked Questions

Does the token-cost problem only affect multi-agent setups, or have single-agent tools already broken budgets?

Single-agent tools already invert the cost equation, Microsoft’s Claude Code pullback and Uber’s four-month budget exhaustion involved individual coding assistants, not multi-agent pipelines. Multi-agent frameworks compound the problem because each parallel branch, retry, and tool invocation generates its own token bill, but the threshold where compute exceeds human cost has already been crossed at the single-tool level.

How did Uber’s internal leaderboards turn AI adoption into a budget problem?

Uber ranked teams by AI tool usage volume on internal leaderboards, incentivizing adoption without tying the metric to productivity outcomes. Fortune and CNBC flag this as part of a broader ‘tokenmaxxing’ culture at companies including Amazon and Meta, where usage volume is conflated with effectiveness, a pattern that becomes especially dangerous when the underlying agent framework cannot attribute which specific actions drove the spend.

If inference costs drop 90% by 2030 as Gartner projects, does the cost-observability gap still matter?

Yes, Goldman Sachs projects a simultaneous 24-fold increase in token consumption from agentic workloads, meaning even at one-tenth the current per-token price, total bills could still reach roughly 2.4x today’s level. Cheaper tokens remove the urgency to reduce consumption per call but increase the volume of transactions needing attribution, making per-agent cost telemetry more necessary, not less.

What does Microsoft’s Security Copilot SCU pricing reveal about consumption-based agent costs?

Security Copilot bills in SCUs at $6 each beyond a free allocation of 400 per 1,000 users, so a single agent workflow with multiple retries and tool calls can burn through several SCUs in one session. Per-SCU billing exposes itemized costs that per-token API pricing hides, but it also means parallel agent plans or recursive calls face compounding per-action charges with no natural ceiling, the exact pattern IBRS warned could be ‘out of control on day one.’