Sonnet 5 vs GPT-5.5: Pricing, Benchmarks, and the Switching Math

Claude Sonnet 5 launched on June 30, 2026 at an introductory rate of $2 per million input tokens and $10 per million output tokens, undercutting GPT-5.5 by 60% on input and 67% on output.¹⁴ After August 31 the rate rises to $3/$15, matching Sonnet 4.6’s existing price point.³ On the five benchmarks both vendors published using comparable harnesses, Sonnet 5 leads on every one. GPT-5.5 holds none. The pricing gap is real. The benchmark lead is real. Neither tells the full story.

Where the benchmarks land

Anthropic’s system card for Sonnet 5 and OpenAI’s GPT-5.5 announcement share four evaluation harnesses: SWE-bench Pro, Terminal-Bench 2.1, Humanity’s Last Exam (with and without tools), and OSWorld-Verified.¹² A fifth comparable point comes from Anthropic’s Opus 4.8 release, which published GPT-5.5 figures Anthropic ran on the same harnesses.¹⁰

Benchmark	Sonnet 5	GPT-5.5	Delta
SWE-bench Pro	63.2%	58.6%	Sonnet 5 +4.6
Terminal-Bench 2.1	80.4%	78.2%	Sonnet 5 +2.2
HLE, no tools	43.2%	41.4%	Sonnet 5 +1.8
HLE, with tools	57.4%	52.2%	Sonnet 5 +5.2
OSWorld-Verified	81.2%	78.7%	Sonnet 5 +2.5

All five Sonnet 5 figures are vendor-reported. All five GPT-5.5 figures are vendor-reported (OpenAI’s for the first four, Anthropic’s replication for the Opus 4.8 release comparison).¹²¹⁰ No independent third party has reproduced either model’s scores on these harnesses with a neutral scaffold as of July 3, 2026.

Two caveats apply. First, Terminal-Bench 2.1 is scaffold-sensitive: GPT-5.5 scores 78.2% via the Terminus harness but 83.4% via OpenAI’s Codex CLI harness, a five-point swing from scaffolding alone.² Anthropic used the Terminus harness for both models, so the 80.4 vs 78.2 comparison above is like-for-like. Teams running Codex CLI should expect GPT-5.5 to close or flip that gap.

Second, Anthropic restated Sonnet 4.6’s OSWorld-Verified score downward (from its original figure to 78.5%) when updating the evaluation methodology, and revised the HLE grader model, which makes the Sonnet 4.6-to-Sonnet 5 deltas harder to compare against the original launch numbers.⁹

Where GPT-5.5 still leads

Sonnet 5 wins the shared agentic benchmarks. GPT-5.5 holds clear leads in areas Anthropic did not publish, and those gaps matter for workload-specific routing decisions.

OpenAI reports GPT-5.5 at 93.6% on GPQA Diamond and 100% on AIME 2025.² Anthropic published neither score for Sonnet 5. Anthropic did not publish an AIME score for Sonnet 5 (AIME is not in Sonnet 5’s evaluation table), which leaves the comparison one-sided. A model that does not report a score has not necessarily scored poorly, but it has not demonstrated parity either.

GPT-5.5 also scores 85.0% on ARC-AGI-2; third-party aggregators place Sonnet 5 at roughly 84.7%, a gap within measurement noise.²⁹ On long-context retrieval, GPT-5.5 posts 74.0% on MRCR v2 at 512K to 1M tokens. Sonnet 5 has no published figure on that benchmark.²

On speed, GPT-5.5 outputs roughly twice as fast as Sonnet 5 in third-party testing and offers a Fast mode in Codex that delivers 1.5x throughput at 2.5x cost.² For high-volume workloads where wall-clock time matters, this is a structural advantage.

The tokenizer problem

Sonnet 5 ships a new tokenizer. The API surface is unchanged; no code modifications required. But the tokenizer produces roughly 30% more tokens for the same text compared to Sonnet 4.6.⁵ Simon Willison measured the overhead independently: 1.42x for English prose, 1.28x for Python code, 1.33x for Spanish, and approximately 1.01x for Simplified Mandarin (effectively unchanged for CJK text).⁶

This matters because both vendors bill by the token. At Sonnet 5’s introductory $2/$10 rate, a team migrating from Sonnet 4.6 at $3/$15 sees a lower per-token price but a higher token count for identical input. The introductory rate appears designed to be roughly cost-neutral for Sonnet 4.6 migrations, which means the headline “60% cheaper than GPT-5.5” figure needs adjustment.⁵⁶

At standard pricing ($3/$15 starting September 1), Sonnet 5’s per-token rate matches Sonnet 4.6 exactly, but the tokenizer overhead makes it roughly 30% more expensive per character of actual text. The effective per-character rate for English workloads lands closer to $3.90/$19.50, narrowing but not closing the gap with GPT-5.5’s $5.00/$30.00.⁶⁸

Artificial Analysis measured the downstream effect on per-task cost. Sonnet 5 costs $2.29 per Intelligence Index task at standard pricing, about twice as much as Sonnet 4.6 ($1.14) and roughly 15% more than Opus 4.8, driven entirely by higher token consumption per task, not the per-token rate.⁷ The per-token price is lower. The per-task price is not.

Cost math for real workloads

For a typical API call of 100,000 input tokens and 10,000 output tokens, the raw per-token calculation at introductory pricing favors Sonnet 5:

Model	Input cost	Output cost	Total
Sonnet 5 (intro)	$0.20	$0.10	$0.30
Sonnet 5 (standard)	$0.30	$0.15	$0.45
GPT-5.5 (short ctx)	$0.50	$0.30	$0.80

Sonnet 5 is 63% cheaper at the introductory rate and 44% cheaper at the standard rate, in raw token terms. For English text workloads, adjust upward by roughly 30% for Sonnet 5’s tokenizer overhead: the effective cost moves to approximately $0.39 (intro) or $0.59 (standard), bringing the savings to roughly 51% and 26% respectively.⁶

For a longer agentic run of 500K input and 50K output tokens, the same pattern holds but the dollar gap widens:

Model	Input cost	Output cost	Total
Sonnet 5 (intro)	$1.00	$0.50	$1.50
Sonnet 5 (standard)	$1.50	$0.75	$2.25
GPT-5.5 (short ctx)	$2.50	$1.50	$4.00

GPT-5.5 also charges a premium for long context: input tokens above 272K trigger higher pricing, with input jumping to $10 per million and output to $45.⁴ Sonnet 5 includes the full 1M context window at the standard rate with no surcharge.³ For workloads pushing past 200K input tokens, the cost gap compounds in Sonnet 5’s favor, which matters for multi-cloud routing decisions.

Both vendors offer batch pricing at 50% of standard rates. Anthropic charges $0.20/MTok for cache reads at the introductory rate ($0.30 standard); OpenAI charges $0.50/MTok for cached short-context input.³⁴

How to choose

The decision breaks down by workload type.

Route to Sonnet 5 when the workload is agentic coding (SWE-bench Pro is the most relevant benchmark), tool-augmented reasoning, or general software engineering. Sonnet 5’s 63.2% on SWE-bench Pro beats GPT-5.5’s 58.6%, and the per-token savings are substantial, especially during the introductory period through August 31.¹ Teams already using Claude Code or the Anthropic API get the model with no integration changes.⁵

Route to GPT-5.5 when the workload demands GPQA-level science reasoning (93.6% vs. no published Sonnet 5 score), math competition performance, or high throughput. GPT-5.5 is roughly twice as fast as Sonnet 5, which matters for latency-sensitive workloads.² Teams using Codex CLI or the OpenAI Responses API get the model natively.

Route to Opus 4.8 when the budget allows and the task is hard enough to warrant the top Anthropic tier. Opus 4.8 still leads Sonnet 5 by 6 points on SWE-bench Pro (69.2% vs. 63.2%) and remains the highest-capability generally available Anthropic model, with Fable 5 access currently suspended.¹⁰

For teams already routing between providers via a gateway like OpenRouter¹³, the pragmatic move is Sonnet 5 as the default workhorse with GPT-5.5 as a fallback for math-heavy or latency-sensitive tasks, and Opus 4.8 for the hardest coding problems. The same workload-based routing logic applies when adding open-weight models to the mix.

What we do not know

Three days of production data leaves most questions unanswered.

Anthropic did not publish GPQA Diamond, MMLU-Pro, MATH, or AIME scores for Sonnet 5. The absence of a GPQA score, where GPT-5.5 posts 93.6%, is conspicuous: it is the single benchmark where Anthropic most likely trails, and omitting it is consistent with a pattern of publishing the evaluations that favor your model.¹² We have documented this pattern before in self-reported benchmark comparisons.

Neither model has accumulated enough votes on the LMSYS Chatbot Arena for a reliable Elo ranking as of July 3, 2026. Sonnet 5 launched three days ago; Arena rankings for both models remain unsettled.⁹

GPT-5.5 carries an 86% hallucination rate on Artificial Analysis’s AA-Omniscience benchmark, which penalizes confident wrong answers specifically.¹¹ OpenAI’s own system card, as reported by Karozieminski, cites a 23% improvement in claim-level accuracy over GPT-5.4. The two figures reflect different methodologies, and OpenAI’s framing is defensible. The AA-Omniscience number is independently measured and is the only hallucination metric available for either model in a comparable framework. Sonnet 5’s hallucination rate on AA-Omniscience has not been published.⁷

Early developer reports from CodeRabbit describe Sonnet 5 as “clearer but chattier,” prone to rewriting plans mid-task, and consuming more tokens per session than expected.¹² These are three-day-old impressions from a small sample, not settled findings.

Frequently Asked Questions

Does Sonnet 5 replace Sonnet 4.6?

No. Anthropic states Sonnet 4.6 remains available at unchanged pricing. Sonnet 5 is the new default for Claude consumer plans but Sonnet 4.6 stays accessible via the API. Teams with existing Sonnet 4.6 integrations can migrate on their own schedule.¹ Teams weighing the upgrade should read our guide to staying on Sonnet 4.6.

Is Sonnet 5’s introductory pricing worth migrating for?

At $2/$10 per million tokens, Sonnet 5 is cheaper per token than Sonnet 4.6 ($3/$15). But the new tokenizer produces roughly 30% more tokens for the same text, making the effective cost roughly similar to what Sonnet 4.6 users pay today. The introductory period runs through August 31, 2026, after which Sonnet 5 and Sonnet 4.6 share the same $3/$15 rate, with Sonnet 5 costing more per character due to the tokenizer.³⁵⁶

How does Sonnet 5 compare to Opus 4.8?

Opus 4.8 retains a clear lead on SWE-bench Pro (69.2% vs. 63.2%) and is Anthropic’s highest-capability generally available model. Sonnet 5 beats Opus 4.8 on Terminal-Bench 2.1 (80.4% vs. 74.6%) and posts a slightly higher GDPval-AA v2 Elo (1618 vs. 1615). For most coding workloads Opus 4.8 is still the stronger Anthropic model; Sonnet 5 offers a lower-cost alternative that approaches but does not match Opus 4.8 on the hardest tasks. See our Opus 4.8 analysis for the full benchmark picture.¹¹⁰

Is GPT-5.5 worth 2.5x the per-token cost of Sonnet 5?

For workloads where GPT-5.5’s strengths matter: GPQA-level reasoning, math competition performance, throughput speed, or Codex CLI integration. For general coding and agentic workloads where Sonnet 5 leads on the shared benchmarks, the extra cost is harder to justify. See our analysis of LLM token economics for why per-token rate alone is not a reliable budget unit.⁴⁷

Can I use both models through a single API?

Yes. Both are available through OpenRouter¹³ and Anthropic’s model is available on AWS Bedrock, Google Cloud, and Microsoft Foundry. GPT-5.5 runs on Azure, AWS Bedrock, and Google Cloud. For teams already using multi-provider routing, adding Sonnet 5 as a routing target requires no structural changes.