FormulaCode's 957-Task Benchmark Catches Frontier Agents Failing at Real-Codebase Performance Optimization

A new benchmark from the FormulaCode project finds that even the best LLM-agent configurations still trail human experts at repository-scale performance optimization, a gap that SWE-Bench leaderboards were never designed to detect. The FormulaCode paper¹, accepted at ICML 2026 after a revised v2 posted on May 14, mines 957¹ bottlenecks from 70 scientific Python repositories and scores agents on multi-objective speedup across an average of 264.6¹ community workloads per task. Unlike binary pass/fail scoring, FormulaCode asks whether a patch actually makes code faster. The answer so far is no: no agent configuration on the public leaderboard² exceeds the human expert baseline.

Why SWE-Bench Verified Was Never Enough

SWE-Bench Verified has become the default credential for agent frameworks. OpenHands, Aider, Devin, and CrewAI all cite their percentages in marketing materials. But Verified is a binary correctness test: the patch either passes the hidden test suite or it fails. It does not measure whether the fix improved runtime, reduced memory pressure, or avoided regressing performance on adjacent workloads.

The blind spot is widening. Recent SWE-Bench Pro results show Claude Opus 4.8 at 69.2%⁶ on Pro (up from 64.3%³ for Opus 4.7), while GPT-5.5 trails at 58.6%⁶. The gap between a model’s Verified score and its Pro score persists even as absolute numbers improve: Pro adds harder tasks, but it still asks only “is it correct?” not “is it faster?” A passing agent could replace an O(n) algorithm with an O(n²) implementation and still score a green checkmark if the test suite’s timeout is generous enough.

Anthropic’s release of Opus 4.8 on May 28, 2026 sharpened the code-reliability claim specifically: the model is four times less likely than Opus 4.7 to allow flaws in code⁶, and its SWE-Bench Pro score of 69.2%⁶ leads the Opus tier. Anthropic’s June 9, 2026 launch of Claude Fable 5⁷ placed a new Mythos-class model above Opus 4.8 in the overall lineup, though Anthropic has not published numeric SWE-Bench scores for Fable 5. That matters for the FormulaCode lens because “fewer flaws” and “faster code” are different objectives. A model that halves defect rates may still produce correct-but-slow patches, exactly the failure mode FormulaCode was designed to surface. See also the broader benchmark landscape in AI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code? and the structural limits of the leaderboard in SWE-bench Verified Explained: What the Coding Agent Leaderboard Actually Measures (and What It Misses).

What FormulaCode Measures (and How)

FormulaCode starts from a different premise. The benchmark mines 957¹ performance bottlenecks from 70 scientific Python repositories on GitHub, each paired with an expert-authored patch and an average of 264.6¹ community-maintained performance workloads, according to the project’s arXiv abstract¹. Scoring is multi-objective: an agent must speed up the target workload without regressing others, and the “advantage” metric penalizes lopsided trade-offs.

The patches are also larger in scope than typical SWE-Bench fixes. The average FormulaCode patch spans 5.2 more lines of code across 1.29x more files and 1.01x more functions than the average SWE-Bench patch, according to the full v2 paper⁴. This is not a matter of swapping a list comprehension for a generator expression. It is repo-level optimization, the kind that requires understanding call graphs, data layout, and how a change in one module propagates through downstream benchmarks.

The Leaderboard: No Agent Beats Human Experts

The FormulaCode leaderboard² is unambiguous. The best-performing configuration, OpenHands with Claude 4.0 Sonnet, achieves a 1.0553x speedup and an advantage score of -0.0096. The human expert baseline sits at 1.1193x speedup and 0.0000 advantage. No agent beats the humans.

Where Agents Excel, and Where They Fold

The v2 paper⁴ breaks down where agents succeed and where they stall. Agents perform well on parallelization and batching strategies, tasks where the structural change is obvious and the risk of cross-workload regression is low. They struggle with lower-level library implementations and vectorized primitives, the exact optimizations that often yield the largest speedups in scientific Python.

Agents are also more conservative than human experts when negotiating trade-offs across multiple workloads. A human maintainer might accept a slight regression in an edge-case benchmark to win a 2x speedup on the hot path. Agents, trained on correctness-first objectives, tend to avoid changes that could trigger any red flag. The result is safer patches that leave performance on the table.

What This Means for Framework Marketing

SWE-Bench Verified scores have become a shorthand for “this agent can code.” Vendors like OpenHands and Cognition (Devin) routinely cite them to imply general engineering competence. FormulaCode shows that implication is false. Binary correctness and performance optimization are different disciplines. A framework that aces SWE-Bench Verified may still be unable to profile a codebase, identify a bottleneck, and verify that its fix improved real-world throughput.

What Vendors Must Build Next

If framework vendors want to claim performance-engineering competence, they need to wire in profiling tooling and workload-level scoring. That means integrating CPU and memory profilers into the agent loop, not just test runners. It means evaluating patches against multiple workloads, not just a hidden test suite. And it means reporting multi-objective metrics like FormulaCode’s advantage score, not just binary pass rates or single-workload speedups.

The benchmark is not a rejection of agentic coding. It is a specification for what the next generation of benchmarks must measure. ICML 2026 will host the paper in Seoul this July, after first author Atharva Sehgal announced the acceptance on LinkedIn⁵ in mid-May. By then, the framework vendors who adjusted their evaluation pipelines will have a credible answer. The ones who kept quoting SWE-Bench Verified will still be selling bug fixes and calling it optimization.

Frequently Asked Questions

Would FormulaCode’s agent failure patterns show up in non-Python codebases?

Not necessarily with the same profile. The 957 tasks are all scientific Python, where the largest speedups come from NumPy-vectorized primitives and low-level library rewrites, the exact category where agents fold. In I/O-bound web services or compiled-language systems, dominant bottlenecks like query planning, lock contention, or cache-line alignment are structurally different, so agents might fail for unrelated reasons or perform comparatively better.

Which configuration has the highest raw speedup, and why doesn’t it top the leaderboard?

OpenHands + GPT-5 leads on raw speedup at 1.0825x, higher than the Claude 4.0 Sonnet configuration’s 1.0553x, yet its advantage score is worse, meaning it degrades more neighboring workloads while optimizing the target. The leaderboard ranks by advantage, not raw speedup, which is why the Claude entry sits above GPT-5 despite a lower top-line figure.

Was ICML 2026 the first venue to review the FormulaCode paper?

No. The paper’s OpenReview record shows a prior submission to an ICLR 2026 workshop before its ICML acceptance. The venue progression indicates the multi-objective, workload-level evaluation framing needed more than one review cycle before a top-tier conference signed off.

What would a concrete SWE-Bench upgrade look like in response to this critique?

The most direct fix would be augmenting SWE-Bench’s test-runner oracle with a performance regression check, running each submitted patch against a latency profile of the original code and flagging fixes that pass tests but introduce measurable slowdowns. Without this, SWE-Bench will continue awarding passing grades to patches that are functionally correct but slower than the code they replace.