Anthropic launched Claude Fable 5 on June 9, 2026, describing it as state-of-the-art on nearly all tested benchmarks of AI capability [Anthropic]. The announcement’s per-benchmark results table was not fully retrievable from cached sources. What is independently verifiable comes from the FrontierCode leaderboard, Cognition’s benchmark methodology, and the Opus 4.8 announcement data. Those sources carry a set of qualifiers, private methodologies, and missing context that most launch-day coverage will skip.
What Does FrontierCode Actually Measure?
FrontierCode, launched by Cognition on June 8, is a code-generation benchmark built around a specific question: would a human reviewer merge the model’s pull request? It uses 150 tasks drawn from 36 open-source repositories, evaluated on correctness, test quality, scope discipline, style, and adherence to codebase standards [Cognition].
The benchmark is split into three tiers. Diamond contains the 50 hardest tasks, Main covers 100, and Extended includes all 150 [Cognition]. Cognition reports that more than 20 open-source maintainers spent over 40 hours per task building the evaluation criteria [Cognition].
This is a deliberate attempt to fix a known problem with SWE-Bench Verified, the dominant coding benchmark. As top models push past 70% on SWE-Bench Verified, Latent Space’s AINews reports that more than half of SWE-bench outputs are unmergeable: they pass automated tests but fail human code review. Cognition claims an 81% lower false-positive rate than SWE-Bench Pro [Cognition].
Where Do Models Actually Land on FrontierCode Diamond?
As of June 8, the BenchLM FrontierCode leaderboard shows Claude Opus 4.8 leading Diamond at 13.4%, followed by GPT-5.5 at 6.3%, Claude Opus 4.7 at 5.2%, and Gemini 3.1 Pro at 4.7% [BenchLM]. No model solves more than 7 of the 50 hardest Diamond tasks.
GPT-5.5 achieves its 6.3% using up to 4× fewer tokens than Opus 4.8, which is worth noting if you care about cost per correct answer rather than raw accuracy [BenchLM].
One structural detail: the BenchLM leaderboard conflates model choice with agent tooling (Claude Code, Codex, Gemini CLI, mini-swe-agent, Devin). The Diamond scores are not pure model-only rankings. BenchLM treats FrontierCode as “display only,” excluding it from the overall scoring formula because tasks are private and rows mix model and tooling variables [BenchLM].
What Does “Medium Effort” Actually Mean?
Anthropic introduced effort control alongside Opus 4.8, letting users select how much reasoning the model applies to a task [Anthropic]. Higher effort means more tokens spent on chain-of-thought reasoning, and higher cost.
For practitioners, the effort axis matters more than raw leaderboard position. A model scoring moderately lower at lower effort but using several times fewer tokens than a competitor at higher effort may still be the better production choice. The leaderboard does not normalize for this.
CursorBench: What the Opus 4.8 Data Shows
CursorBench is an agentic coding benchmark. In the Opus 4.8 announcement, Anthropic reported that Opus 4.8 exceeds prior Opus models across every effort level on CursorBench. CursorBench’s methodology, task composition, and public leaderboard were not available in the verified sources. This is a partner benchmark, not an independent third-party evaluation.
What Anthropic Did Not Publish
The launch announcement does not include per-suite score tables broken down by benchmark, effort level, or task tier. For FrontierCode specifically, there is no published breakdown showing Fable 5’s performance across Diamond, Main, and Extended tiers. There is no public comparison at matched effort levels across models.
This is standard launch-day practice. Vendor announcements highlight the strongest available number and omit the context that would make it comparable. In this case, the omissions are particularly material because:
- Effort control is new. The ability to select reasoning intensity is a confound that did not exist in earlier benchmark comparisons. Reporting scores without showing the curve across effort levels hides the cost-accuracy tradeoff.
- Agent tooling matters. FrontierCode scores on BenchLM combine model and tooling. The leaderboard itself acknowledges this by excluding FrontierCode from its aggregate scoring.
- Partner benchmarks resist replication. Benchmarks run by commercial partners without public methodology cannot be independently verified, regardless of whether the scores are accurate.
How to Read Vendor Launch Benchmarks
The pattern is consistent across model launches: highlight the benchmarks where the model leads, report at the effort level that produces the best headline-to-cost ratio, and cite partner benchmarks that lack public methodology. Anthropic is not unusual here. But the practical implications are worth stating flatly.
Check the denominator. A “state of the art” score on a benchmark where the best model is at 13% [BenchLM] tells you the problem is hard, not that the model is good. FrontierCode Diamond is useful exactly because it exposes the gap between benchmark performance and production readiness.
Normalize for effort and cost. Two models can have the same score at very different token costs. GPT-5.5’s 6.3% on FrontierCode Diamond at 4× fewer tokens than Opus 4.8’s 13.4% may be the better production choice depending on your throughput and budget [BenchLM].
Distinguish partner benchmarks from independent ones. Hebbia, Cursor, and Databricks all have commercial relationships with Anthropic. Their benchmarks may be rigorous, but they are not independent third-party evaluations.
Wait for the leaderboard. Public benchmarks with open submissions (FrontierCode on BenchLM, SWE-Bench Verified) will populate with new model scores within days. Those numbers, normalized by the same evaluation setup across models, are more reliable than launch-day press releases.
FrontierCode itself is the most interesting artifact here. A benchmark where the best model scores 13% [BenchLM] on the hard tier, with an 81% lower false-positive rate than SWE-Bench Pro [Cognition], addresses a real gap in how the field evaluates code generation.
Frequently Asked Questions
What grading methods make FrontierCode harder to game than SWE-Bench?
FrontierCode grades on six axes (behavioral correctness, regression safety, build/lint/style cleanliness, test correctness, scope discipline, and code quality) rather than just test pass rate. Its reverse-classical testing requires that agent-written tests must actually fail on intentionally broken code, catching models that generate vacuous passing tests. Cognition also uses an LLM tool called mutagent for adaptive classical grading. These methods target the specific failure mode where SWE-Bench models pass all tests but produce unmergeable code.
What did the Hebbia and Databricks partner benchmarks measure for Opus 4.8?
On Hebbia’s financial-document benchmark, Opus 4.8 matched Opus 4.7’s output quality while improving citation precision and token efficiency. On Databricks’ Genie benchmark, Opus 4.8 ran at 61% lower token cost than Opus 4.7 for equivalent tasks. Both benchmarks are run by Anthropic commercial partners, so the scores cannot be treated as independent validation, but they do suggest the cost per quality unit is dropping within the Claude family.
Why did Anthropic report Fable 5’s FrontierCode score at medium effort?
Reporting at medium effort is a cost-positioning signal. If Fable 5 matches or beats Opus 4.8’s Diamond score at medium effort, Anthropic can claim comparable results at lower token cost than competitors running at higher effort. This framing sidesteps what Opus 4.8 or GPT-5.5 would score at the same medium effort level, since those rows are not on the leaderboard. The medium qualifier also means Fable 5’s actual ceiling on FrontierCode Diamond is unknown: it could be higher, or the curve could be flat above medium.
What happens when a benchmark saturates the way SWE-Bench Verified has?
When a benchmark saturates, it stops differentiating between top models: every frontier system looks roughly equivalent, and the benchmark can no longer detect regressions or improvements in the capabilities it was built to measure. FrontierCode was designed in direct response to SWE-Bench’s saturation, and its Diamond tier, where no model exceeds 14%, has substantial headroom. The structural risk is that FrontierCode itself will saturate as models improve, requiring the same kind of successor benchmark within a few model generations.