Anthropic released claude-opus-4-8 on May 28, 2026.2 The pricing is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens.1 That pricing parity is the central fact in the upgrade decision. You are not trading cost for quality. You are trading a quality floor for a higher one, with no new line item on the bill.
The question for a coding team is whether the quality improvement is large enough to matter in practice, and whether there are cases where Opus 4.8 still falls short of the competition.
What Opus 4.8 Actually Improved on Coding Tasks
The headline number is SWE-Bench Pro, Anthropic’s primary agentic coding benchmark for this release. Opus 4.8 scores 69.2% versus 64.3% for Opus 4.71, a 4.9-point gain. For context, GPT-5.5 scores 58.6% and Gemini 3.1 Pro scores 54.2% on the same benchmark.1 On this specific measure, Opus 4.8 leads the field by a wide margin.
Beyond the aggregate score, Anthropic reports that Opus 4.8 is four times less likely than its predecessor to allow flaws in code.1 That claim is not a benchmark percentage. It is a reliability characterization. If your team’s current pain point is subtle errors that slip through agent-generated patches, this is the number most directly relevant to your workflow.
Terminal-Bench 2.1, which tests agentic terminal coding, tells a different story. Opus 4.8 scores 74.6%1, up from Opus 4.7’s 66.1%.1 That is a substantial 8.5-point improvement over the prior generation. GPT-5.5, however, scores 78.2%1 on Terminal-Bench, placing it ahead of Opus 4.8 on this benchmark. That gap is real and should factor into any evaluation for teams whose work centers on terminal-heavy or shell-heavy workflows.
On Humanity’s Last Exam with tools, Opus 4.8 scores 57.9% versus 54.7% for Opus 4.7.1 On OSWorld-Verified (agentic computer use), the scores are 83.4% versus 82.8%.1 The GDPval-AA knowledge work score is 1890 versus 1753.1 Across these measures, Opus 4.8 consistently outperforms its predecessor, though the margins vary.
The Fast Mode Option
Opus 4.8 ships with a fast mode priced at $10 per million input tokens and $50 per million output tokens.1 That is double the standard rate in exchange for approximately 2.5x speed.1 Anthropic notes that fast mode is three times cheaper than it was for previous fast-mode models.1
For a team running CI-integrated code review or quick iteration loops where latency is the primary constraint, fast mode represents a genuine option that did not exist at this price point before. At double the standard token cost, the economics work out if a 2.5x speed improvement lets you collapse two review cycles into one.
Standard mode remains at the same $5/$25 pricing as Opus 4.7.3
When to Upgrade
The case for upgrading now is strong when your team’s dominant use case aligns with what Opus 4.8 leads on. SWE-Bench Pro, the agentic coding benchmark where Opus 4.8 scores 69.2%1 against GPT-5.5’s 58.6%1, covers the class of tasks that look like: find a bug in a real repository, write a fix, verify it does not break the test suite. If that describes how your team uses Claude Code day-to-day, the combination of the benchmark lead and the four-times-lower code flaw rate1 points toward upgrading.
Anthropic also characterizes Opus 4.8 as more likely to flag uncertainties and less likely to make unsupported claims, and as able to work independently for longer before requiring human check-ins.1 For teams running multi-step agentic workflows where an agent might work through a large refactor or a long debugging session, the extended autonomy claim is relevant. The practical test is whether your current workflows hit reliability walls at complex task boundaries.
The same pricing as 4.7 removes the standard reason to hold off.3 Unless your team has a specific reason to distrust the new model or has validated a workflow that depends on 4.7’s particular behavior, upgrading carries no cost penalty.
When to Hold or Mix Models
GPT-5.5’s 78.2% Terminal-Bench score1 versus Opus 4.8’s 74.6%1 is the case for pausing before a full switch. A 3.6-point gap on a single benchmark is not large in absolute terms, but Terminal-Bench covers agentic tasks that many developer workflows depend on: shell scripting, command-line debugging, and multi-step terminal operations. Teams whose workloads are heavily terminal-oriented should run their own evaluation on representative tasks before committing to Opus 4.8 as their sole model.
The cleanest approach for teams with mixed workflows is a routing layer. Use Opus 4.8 for repository-level code tasks, multi-file refactors, and code review. Evaluate whether GPT-5.5 produces better results for terminal-heavy scripts and shell automation. Neither model leads on every task type, and the pricing difference between providers may favor one path or the other depending on usage volume.
Opus 4.8’s knowledge cutoff is January 2026.2 For teams working with APIs, libraries, or frameworks that have had significant changes in the first half of 2026, that cutoff matters regardless of benchmark scores.
What the Dynamic Workflows Preview Adds
Alongside Opus 4.8, Anthropic shipped a Claude Code research preview called dynamic workflows, which allows a single session to run many parallel subagents.1 This is separate from the model itself and is in research preview status. For teams using Claude Code for large-scale tasks like analyzing an entire codebase or running parallel test generation across many files, the parallel subagent capability extends what is achievable in a session. Whether this preview feature reaches stable release and at what pricing is not yet established.
Frequently Asked Questions
Does Opus 4.8 cost more than Opus 4.7?
Standard pricing is identical at $5 per million input tokens and $25 per million output tokens.1 There is no cost increase for upgrading.
What is the model ID to use in the API?
The API model ID is claude-opus-4-8.2
Is the SWE-Bench Pro gain large enough to matter in production?
A 4.9-point gain (64.3% to 69.2%1) is a concrete improvement on a benchmark that measures real repository tasks. Whether that translates to your specific codebase depends on task complexity, repository structure, and how representative the benchmark tasks are of your actual work. The four-times-lower code flaw rate1 is a stronger indicator for teams prioritizing reliability in automated code changes.
Should a team use Opus 4.8 for all AI coding tasks?
Not necessarily. Terminal-Bench 2.1 shows GPT-5.5 at 78.2%1 versus Opus 4.8 at 74.6%1. Teams with significant terminal-focused workloads have reason to evaluate both models rather than committing exclusively to one.
What context window does Opus 4.8 support?
One million input tokens with up to 128,000 output tokens.2 The context window is unchanged from Opus 4.7.