Table of Contents

A new benchmark from the FormulaCode project finds that even the best LLM-agent configurations still trail human experts at repository-scale performance optimization, a gap that SWE-Bench leaderboards were never designed to detect. The FormulaCode paper1, accepted at ICML 2026 after a revised v2 posted on May 14, mines 9571 bottlenecks from 70 scientific Python repositories and scores agents on multi-objective speedup across an average of 264.61 community workloads per task. Unlike binary pass/fail scoring, FormulaCode asks whether a patch actually makes code faster. The answer so far is no: no agent configuration on the public leaderboard2 exceeds the human expert baseline.

Why SWE-Bench Verified Was Never Enough

SWE-Bench Verified has become the default credential for agent frameworks. OpenHands, Aider, Devin, and CrewAI all cite their percentages in marketing materials. But Verified is a binary correctness test: the patch either passes the hidden test suite or it fails. It does not measure whether the fix improved runtime, reduced memory pressure, or avoided regressing performance on adjacent workloads.

The blind spot is widening. Recent SWE-Bench Pro results show Claude Opus 4.7 falling from 87.6%3 on Verified to 64.3%3 on Pro, while GPT-5.4 drops from roughly 85%3 to 59.1%3, according to TokenMix’s benchmark roundup3. The Pro variant adds harder tasks, but it still asks only “is it correct?” not “is it faster?” A passing agent could replace a O(n) algorithm with a O(n²) implementation and still score a green checkmark if the test suite’s timeout is generous enough.

What FormulaCode Measures (and How)

FormulaCode starts from a different premise. The benchmark mines 9571 performance bottlenecks from 70 scientific Python repositories on GitHub, each paired with an expert-authored patch and an average of 264.61 community-maintained performance workloads, according to the project’s arXiv abstract1. Scoring is multi-objective: an agent must speed up the target workload without regressing others, and the “advantage” metric penalizes lopsided trade-offs.

The patches are also larger in scope than typical SWE-Bench fixes. The average FormulaCode patch spans 5.2 more lines of code across 1.29x more files and 1.01x more functions than the average SWE-Bench patch, according to the full v2 paper4. This is not a matter of swapping a list comprehension for a generator expression. It is repo-level optimization, the kind that requires understanding call graphs, data layout, and how a change in one module propagates through downstream benchmarks.

The Leaderboard: No Agent Beats Human Experts

The FormulaCode leaderboard2 is unambiguous. The best-performing configuration, OpenHands with Claude 4.0 Sonnet, achieves a 1.0553x speedup and an advantage score of -0.0096. The human expert baseline sits at 1.1193x speedup and 0.0000 advantage. No agent beats the humans.

Where Agents Excel, and Where They Fold

The v2 paper4 breaks down where agents succeed and where they stall. Agents perform well on parallelization and batching strategies, tasks where the structural change is obvious and the risk of cross-workload regression is low. They struggle with lower-level library implementations and vectorized primitives, the exact optimizations that often yield the largest speedups in scientific Python.

Agents are also more conservative than human experts when negotiating trade-offs across multiple workloads. A human maintainer might accept a slight regression in an edge-case benchmark to win a 2x speedup on the hot path. Agents, trained on correctness-first objectives, tend to avoid changes that could trigger any red flag. The result is safer patches that leave performance on the table.

What This Means for Framework Marketing

SWE-Bench Verified scores have become a shorthand for “this agent can code.” Vendors like OpenHands and Cognition (Devin) routinely cite them to imply general engineering competence. FormulaCode shows that implication is false. Binary correctness and performance optimization are different disciplines. A framework that aces SWE-Bench Verified may still be unable to profile a codebase, identify a bottleneck, and verify that its fix improved real-world throughput.

What Vendors Must Build Next

If framework vendors want to claim performance-engineering competence, they need to wire in profiling tooling and workload-level scoring. That means integrating CPU and memory profilers into the agent loop, not just test runners. It means evaluating patches against multiple workloads, not just a hidden test suite. And it means reporting multi-objective metrics like FormulaCode’s advantage score, not just binary pass rates or single-workload speedups.

The benchmark is not a rejection of agentic coding. It is a specification for what the next generation of benchmarks must measure. ICML 2026 will host the paper in Seoul this July, after first author Atharva Sehgal announced the acceptance on LinkedIn5 in mid-May. By then, the framework vendors who adjusted their evaluation pipelines will have a credible answer. The ones who kept quoting SWE-Bench Verified will still be selling bug fixes and calling it optimization.

Frequently Asked Questions

Would FormulaCode’s agent failure patterns show up in non-Python codebases?

Not necessarily with the same profile. The 957 tasks are all scientific Python, where the largest speedups come from NumPy-vectorized primitives and low-level library rewrites — the exact category where agents fold. In I/O-bound web services or compiled-language systems, dominant bottlenecks like query planning, lock contention, or cache-line alignment are structurally different, so agents might fail for unrelated reasons or perform comparatively better.

Which configuration has the highest raw speedup, and why doesn’t it top the leaderboard?

OpenHands + GPT-5 leads on raw speedup at 1.0825x — higher than the Claude 4.0 Sonnet configuration’s 1.0553x — yet its advantage score is worse, meaning it degrades more neighboring workloads while optimizing the target. The leaderboard ranks by advantage, not raw speedup, which is why the Claude entry sits above GPT-5 despite a lower top-line figure.

Was ICML 2026 the first venue to review the FormulaCode paper?

No. The paper’s OpenReview record shows a prior submission to an ICLR 2026 workshop before its ICML acceptance. The venue progression indicates the multi-objective, workload-level evaluation framing needed more than one review cycle before a top-tier conference signed off.

What would a concrete SWE-Bench upgrade look like in response to this critique?

The most direct fix would be augmenting SWE-Bench’s test-runner oracle with a performance regression check — running each submitted patch against a latency profile of the original code and flagging fixes that pass tests but introduce measurable slowdowns. Without this, SWE-Bench will continue awarding passing grades to patches that are functionally correct but slower than the code they replace.

Footnotes

  1. FormulaCode paper 2 3 4 5 6

  2. FormulaCode leaderboard 2

  3. TokenMix benchmark roundup 2 3 4 5

  4. FormulaCode v2 paper 2

  5. Atharva Sehgal on LinkedIn

Sources

  1. FormulaCode paperprimaryaccessed 2026-05-18
  2. FormulaCode leaderboardprimaryaccessed 2026-05-18
  3. TokenMix benchmark roundupanalysisaccessed 2026-05-18
  4. FormulaCode v2 paperprimaryaccessed 2026-05-18
  5. Atharva Sehgal on LinkedIncommunityaccessed 2026-05-18

Enjoyed this article?

Stay updated with our latest insights on AI and technology.