Table of Contents

When an AI agent “passes” SWE-bench, it means the automated test suite approved the patch. It does not mean a real engineer would accept the code. New research from METR (Model Evaluation and Threat Research) quantifies that gap for the first time: roughly half of SWE-bench-passing PRs would be rejected by the actual maintainers of those repositories—a finding that reframes two years of AI coding benchmarks.

What SWE-Bench Actually Measures

SWE-bench, introduced by Princeton researchers in October 2023, set the standard for evaluating AI software engineering.1 The benchmark presents models with 2,294 real GitHub issues from twelve popular Python repositories—scikit-learn, Django, Sphinx, pytest, and others—then grades their patches using a “fail-to-pass” signal: did the tests that failed before the patch now pass after?

The framing was sound. Real issues, real test suites, real repositories. When Claude 2 resolved only 1.96% of problems at launch, the benchmark correctly reflected that LLMs had significant ground to cover. When Devin hit 13.86% in March 2024, it demonstrated measurable progress.2 By 2025, frontier models were clearing 50–70% on SWE-bench Verified, OpenAI’s human-curated 500-task subset.

But “passing tests” and “shippable code” are not the same thing. SWE-bench graders don’t care about variable naming, code style, architectural consistency, or whether the patch accidentally breaks an edge case the test suite doesn’t cover. Human maintainers do.

The Maintainer Study: 24 Percentage Points of Missing Reality

METR’s March 2026 paper, “Many SWE-bench-Passing PRs Would Not Be Merged into Main,” is the most direct attack on benchmark face validity to date.3 Four active maintainers from scikit-learn, Sphinx, and pytest reviewed 296 AI-generated PRs across 95 issues—all from patches that passed SWE-bench’s automated grader—plus 47 human-written merged PRs used as a baseline.

The result: automated grader scores averaged 24.2 percentage points higher than actual maintainer merge rates (standard error: 2.7 pp). Put differently, if a model scores 50% on SWE-bench, the realistic expectation for real-world merge acceptance is closer to 26%.

Even the human-written “gold” patches weren’t immune. Maintainers deemed only 68% of the human reference patches mergeable—despite those same patches scoring 100% from the automated grader. That 32-point gap for human-written code reveals something important: the benchmark’s grading signal was never designed to represent maintainer judgment. It was designed to measure whether a specific behavior changed. Those are different questions.

The trajectory compounds the problem. The automated grader’s improvement rate is outpacing maintainer-assessed improvement by 9.6 percentage points per year (standard error: 5.5 pp). Benchmark scores are accelerating faster than actual utility. The gap isn’t closing—it’s widening.

Maintainer rejections clustered into four categories:

  • Code quality issues: style violations, inconsistent naming, non-idiomatic implementations
  • Unintended side effects: patches that broke behavior the tests didn’t cover
  • Core functionality failure: solving the symptom, not the underlying problem
  • Other undocumented failures: edge cases invisible to automated grading

A Benchmark Already Unraveling Before This Study

The METR paper didn’t arrive in a vacuum. Benchmark critics had been accumulating evidence for over a year.

OpenAI abandoned SWE-bench Verified in 2025 after auditing 138 problems that its o3 model consistently failed across 64 independent runs. Six experienced software engineers reviewed each case. Their finding: 59.4% of those 138 tasks contained “material issues in test design and/or problem description, rendering them extremely difficult or impossible even for the most capable model or human to solve.”4 The breakdown: 35.5% had overly strict tests enforcing specific implementation details; 18.8% had tests checking functionality never described in the issue.

SWE-bench+ from York University found that 32.67% of resolved instances involved direct solution leakage—where the fix was effectively outlined in the issue report or its comments. When filtered for leakage and weak test cases, SWE-Agent + GPT-4’s performance dropped from 12.47% to 3.97%—a 68% reduction in apparent capability.5

The Zhejiang/Stuttgart “Are Solved Issues Really Solved Correctly?” paper applied differential patch testing to SWE-bench Verified and found 7.8% of plausible patches were definitively incorrect, with 29.6% exhibiting behavioral differences from ground truth that existing tests missed. The aggregate effect: a ~6.4 percentage point inflation in reported resolution rates.6

StudyFindingImpact
METR Maintainer Review (2026)24.2 pp gap between grader score and merge rateSystematic overestimate of real-world utility
OpenAI Audit (2025)59.4% of hard tasks had material test flawsBenchmark validity questioned at frontier
SWE-bench+ (York, 2025)32.67% of resolved tasks had solution leakageGPT-4+SWE-Agent drops from 12.47% to 3.97%
Zhejiang/Stuttgart (2026)6.4 pp score inflation from undetected incorrect patchesReported rates overstate correctness
SWE-bench Pro (Scale AI)Top models drop from 70%+ to 17–23% on harder tasksScores collapse without benchmark-specific optimization

The Contamination Layer

Performance inflation isn’t only about bad tests. Multiple studies point to a more troubling possibility: models may be partly remembering rather than reasoning through SWE-bench tasks.

A Purdue/Microsoft team found that models achieve up to 76% accuracy identifying buggy file paths using only issue descriptions—no repository structure, no code.7 That’s suspicious. If a model can locate the right file from a prose description alone, it likely encountered that issue during training. Their contamination analysis found SWE-bench Verified exhibiting “up to 35% consecutive 5-gram overlap ratio” versus 18% for other benchmarks—a strong signal of training-time exposure.

The University of Waterloo study compared SWE-bench Verified against BeetleBox, a benchmark using equivalent repositories but different issues. Models performed 3x better on SWE-bench Verified; with minimal context (issue text only), the gap grew to 6x.8 Microsoft Research’s SWE-bench Live reinforced this: the best-performing agent scored 22.96% on known SWE-bench repositories but only 18.89% on repositories never in the benchmark.9

Scale AI’s SWE-bench Pro measured the collapse directly. Top models (GPT-5, Claude Opus 4.1) scored over 70% on SWE-bench Verified. On SWE-bench Pro’s public dataset—harder tasks from different repositories—they dropped to 23.3% and 23.1%. On the private, less contaminated dataset, they fell to 14.9% and 17.8% respectively.10

What Maintainers Care About That Tests Don’t

The METR study’s taxonomy of rejection reasons maps directly to the gap between automated grading and human judgment. Consider what “passing tests” cannot capture:

Idiomatic consistency. A patch that fixes a bug using a pattern inconsistent with the rest of the codebase may be technically correct but creates maintenance debt. Test suites don’t encode idiom.

Minimal diff principle. Good patches change only what they must. AI-generated code frequently includes “supplementary semantic changes”—unnecessary modifications that inflate diff size and introduce review risk. The Zhejiang study found this pattern in 27.3% of suspicious patches.6

Architectural alignment. The “right” fix for an issue often requires understanding why existing code is structured a certain way. A model that solves the symptom without understanding the design philosophy may produce a patch that undermines future extensibility.

Regression in untested paths. Test suites are incomplete by definition. A patch that passes all tests while silently breaking an untested interaction is invisible to automated grading but visible to an experienced reviewer scanning the diff.

GitClear’s analysis of 211 million changed lines through 2025 found 4x growth in code cloning attributed to AI-assisted development, alongside code churn rates projected to double from their 2021 baseline.11 Code that passes review but gets reverted within weeks is the real-world manifestation of what SWE-bench graders can’t see.

What This Means for Practitioners

The METR team is explicit that their findings don’t prove agents have hit a capability ceiling. The study didn’t give agents the ability to iterate on maintainer feedback—a capability gap that exists in a controlled experiment but not in production. Real agentic workflows can incorporate feedback loops, style guides, and linter output that the benchmark doesn’t model.

But the study does expose a specific danger: using SWE-bench scores to make procurement or deployment decisions. A model that scores 50% on SWE-bench may be generating code that real engineers would reject at twice the rate the benchmark implies.

For teams evaluating AI coding tools, several practices follow from this research:

Evaluation checklist for AI coding assistants:
- [ ] Run candidates on your actual repositories, not benchmark repos
- [ ] Have engineers review AI-generated PRs before merging, even when tests pass
- [ ] Track real metrics: PR merge rate, post-merge revert rate, code churn
- [ ] Apply style linters and architecture conformance checks as automated gates
- [ ] Weight benchmark performance on SWE-bench Pro or SWE-bench Live over saturated subsets

The METR team’s 9.6 pp/year benchmark-drift finding suggests this problem will compound. As models optimize further for test-passage signals, the gap between “what scores well” and “what ships” will widen unless the benchmark community develops evaluation signals that maintainers actually care about.

OpenAI’s move away from SWE-bench Verified toward SWE-bench Pro is the canary in the coalmine. When the benchmark creator abandons the benchmark because frontier models have made it unrepresentative, the field’s evaluation infrastructure is in a transitional state. Until next-generation benchmarks that incorporate real merge judgment at scale exist, SWE-bench scores are best treated as a floor, not a ceiling—and a floor with known structural problems.


Frequently Asked Questions

Q: What is SWE-bench and why does it matter? A: SWE-bench is a standard benchmark from Princeton that tests AI models on real GitHub issues from popular Python repositories, grading patches by whether their tests pass. It became the primary reference point for comparing AI coding agents—which makes its accuracy critical for anyone evaluating tools.

Q: How large is the gap between SWE-bench scores and real-world PR acceptance? A: METR’s March 2026 study measured a 24.2 percentage point gap: if a model scores 50% on SWE-bench Verified, expect approximately 26% of those patches to be judged mergeable by actual project maintainers.

Q: Are AI coding tools still useful despite these benchmark problems? A: Yes, but context matters. Controlled studies show AI tools accelerate greenfield tasks and simpler bug fixes significantly. The problems emerge at scale in existing codebases with established conventions, where test-passing doesn’t capture style, architecture, or maintainability concerns.

Q: What alternatives to SWE-bench Verified should practitioners watch? A: SWE-bench Pro (Scale AI) applies harder, longer-horizon tasks and shows frontier models dropping to 17–23%. SWE-bench Live (Microsoft Research) provides continuously updated tasks from new issues, reducing contamination. Both are more conservative estimates of real capability.

Q: Should teams stop tracking SWE-bench scores entirely? A: Not entirely—they remain useful for directional comparisons between models on identical task sets. But they should be supplemented with real-world metrics: PR merge rates, post-merge revert frequency, and direct maintainer or senior engineer review of AI-generated diffs.


Footnotes

  1. Carlos E. Jimenez et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” Princeton Language and Intelligence, October 2023. https://arxiv.org/abs/2310.06770

  2. Cognition AI. “SWE-bench Technical Report.” Cognition AI, March 2024. https://cognition.ai/blog/swe-bench-technical-report

  3. Parker Whitfill, Cheryl Wu, Joel Becker, Nate Rush. “Many SWE-bench-Passing PRs Would Not Be Merged into Main.” METR, March 10, 2026. https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/

  4. Mia Glaese and Olivia Watkins. “Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities.” OpenAI, 2025. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

  5. Reem Aleithan et al. “SWE-Bench+: Enhanced Coding Benchmark for LLMs.” York University, 2024. https://arxiv.org/abs/2410.06992

  6. You Wang, Michael Pradel, Zhongxin Liu. “Are ‘Solved Issues’ in SWE-bench Really Solved Correctly? An Empirical Study.” Zhejiang University / University of Stuttgart, 2025. https://arxiv.org/abs/2503.15223 2

  7. Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam. “The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason.” Purdue University / Microsoft, 2025. https://arxiv.org/html/2506.12286v3

  8. Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan. “Does SWE-Bench-Verified Test Agent Ability or Model Memory?” University of Waterloo, 2024. https://arxiv.org/html/2512.10218v1

  9. Microsoft Research. “SWE-bench Goes Live!” NeurIPS 2025 Datasets and Benchmarks. https://arxiv.org/abs/2505.23419

  10. Scale AI. “SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?” 2025. https://arxiv.org/abs/2509.16941

  11. GitClear. “AI Copilot Code Quality 2025 Research.” GitClear, 2025. https://www.gitclear.com/ai_assistant_code_quality_2025_research

Enjoyed this article?

Stay updated with our latest insights on AI and technology.