SWE-bench Verified is a 500-task benchmark drawn from real GitHub issues across 12 Python repositories. An AI agent receives a codebase and issue description, then generates a patch. Success is binary: either all designated unit tests pass, or they don’t. A score of “49% resolved” means the agent correctly patched 245 of those 500 issues—nothing more, and often much less than what ships in your production codebase.
Every week brings a new coding model announcement citing SWE-bench scores. Devin scored 13.86% and made headlines. Claude 3.5 Sonnet hit 49% and set a new record. As of early 2026, leading models on the bash-only leaderboard top 76%.1 These numbers compress enormously complex engineering capability into a single percentage—and that compression hides more than it reveals.
Understanding what SWE-bench Verified actually tests is essential for any practitioner evaluating coding agents. The gap between benchmark performance and production utility is wide, and it runs in both directions.
What Is SWE-bench Verified?
SWE-bench launched in October 2023 as a joint project from Princeton NLP, evaluating language models on 2,294 real GitHub issues sourced from 12 popular Python repositories.2 The core premise was elegantly practical: instead of synthetic coding puzzles, test models on the same issues real open-source maintainers had to resolve.
The repositories represent a cross-section of mature Python projects: astropy, django, matplotlib, numpy, pandas, scikit-learn, sympy, pytest, and others. These codebases are large, well-tested, and representative of production-grade software—not toy examples.
SWE-bench Verified arrived in August 2024, created in collaboration with OpenAI.3 Human annotators reviewed 500 instances from the full dataset and verified three properties for each:
- Problem clarity — the issue description is unambiguous and actionable
- Test correctness — the designated tests actually verify the fix described
- Solvability — a competent software engineer can resolve it without external context
That third criterion is critical. The original SWE-bench contained instances where even expert human annotators couldn’t verify whether a model’s solution was correct—because the ground truth tests were flawed or the problem statements were ambiguous. Verified cleans this up, making the benchmark more reliable as a comparison tool.
How Scoring Actually Works
The evaluation mechanism is simpler than most practitioners assume. Each instance has two test categories:
- FAIL_TO_PASS: Tests that fail against the base commit and must pass after the agent’s patch
- PASS_TO_PASS: Tests that pass on the base commit and must continue passing after the patch
An instance is “resolved” only if every test in both categories passes after the patch is applied. This is all-or-nothing: a patch that fixes seven of eight failing tests scores zero.
The evaluation runs inside Docker containers, providing reproducibility at the cost of real-world fidelity. There’s no IDE, no compiler feedback loop, no ability to run the full test suite interactively. Agents typically receive only the repository files and the issue description. Git history beyond the base commit is removed to prevent information leakage.
Typical resource constraints per instance during evaluation: 45 minutes of wall-clock time, with successful runs often consuming 100,000+ tokens and hundreds of model turns.4
Current Leaderboard Landscape
The SWE-bench Verified leaderboard has moved fast. Understanding the progression provides context for what current scores mean:
| Model / Agent | SWE-bench Verified Score | Date |
|---|---|---|
| Claude 2 (oracle retrieval) | 4.8% | Oct 2023 |
| Devin (Cognition) | 13.86% | Mar 2024 |
| Amazon Q Developer | 20.3% (Lite) | May 2024 |
| Claude 3.5 Sonnet (old) | 33% | Oct 2024 |
| Claude 3.5 Sonnet (new) | 49% | Nov 2024 |
| Claude 4.5 Opus (mini-SWE-agent v2) | 76.8% | Feb 2026 |
| Gemini 3 Flash (mini-SWE-agent v2) | 75.8% | Feb 2026 |
The jump from 49% to 76% over roughly 15 months reflects both improved model capability and improved scaffolding. Mini-SWE-agent v2 uses tool calling rather than parsing actions from output strings—a significant scaffolding improvement that alone accounts for some of the score increase.5
Commercial agent systems (Devin, Cursor, GitHub Copilot Workspace) and research frameworks (OpenHands, AutoCodeRover) appear in a separate full-system leaderboard, while the bash-only leaderboard isolates raw model capability. This distinction matters when reading claims: a product demo using proprietary scaffolding, RAG, and multi-agent review is a fundamentally different system than a single model in a bash shell.
What SWE-bench Verified Does Well
The benchmark’s design choices hold up under scrutiny. Using real GitHub issues avoids the synthetic-task problem that plagues benchmarks like HumanEval, where models may be tested on code similar to their training data. The FAIL_TO_PASS structure provides ground truth: the original developers wrote these tests to verify exactly the behavior that was broken.
The human verification process meaningfully improves signal quality. Early SWE-bench results were noisy partly because some instances had broken test suites or poorly specified problems. The 500-instance Verified subset removes this confound, giving more interpretable scores.
The difficulty distribution across four tiers—from straightforward single-file fixes to complex multi-component issues—allows finer-grained analysis beyond the headline number. A model scoring 76% overall might resolve 95% of easy instances and 40% of hard ones, information that matters more than the aggregate.
The Benchmark’s Critical Blind Spots
Understanding what SWE-bench Verified omits is where the real practitioner insight lives.
Python-Only, 12 Repos
The entire benchmark draws from 12 Python repositories. TypeScript, Go, Rust, Java—absent. A model’s SWE-bench score tells you nothing about its performance on your Node.js backend or your Kotlin Android app. The repos also skew toward data science and scientific computing (numpy, pandas, astropy, sympy), which differs substantially from the web services and microservices work that dominates most engineering organizations.
Bug Fixes, Not Feature Development
The issues in SWE-bench are predominantly bug fixes identified through GitHub issue trackers—something broke, and there’s an existing test that proves it. This excludes the substantial portion of real engineering work that involves:
- Designing and implementing new features from scratch
- Writing tests before implementation (test-driven development)
- Large-scale refactors affecting multiple subsystems
- Architecture decisions where no single correct answer exists
No Test Authoring
Perhaps the most significant gap: SWE-bench never asks an agent to write tests. In practice, a substantial part of professional software development involves authoring tests alongside implementation. A model that resolves 76% of pre-specified test suites might perform very differently when tasked with determining what to test for a new feature.
No Code Quality Signal
A patch that makes tests pass scores the same as a patch that makes tests pass while introducing a subtle N+1 query, duplicating business logic, or using a deprecated API. The benchmark provides zero signal on:
- Code readability and maintainability
- Performance characteristics
- Security implications of the change
- Whether the approach aligns with the project’s conventions
Single-Turn Resolution
Real software development is iterative. A developer submits a PR, receives review feedback, revises the implementation, and repeats. SWE-bench measures single-shot resolution: either the agent’s patch works or it doesn’t. This favors agents that generate conservative, narrowly-scoped changes and penalizes exploratory approaches that might be more appropriate for genuinely ambiguous problems.
Training Data Contamination
The 12 repositories in SWE-bench are among the most widely-referenced Python codebases on the internet—and therefore likely well-represented in model training data. While the benchmark uses specific historical commits as base states, there’s no strong guarantee that a model hasn’t seen both the issue and the solution in its pretraining corpus. This concern is particularly acute for older issues in popular repositories.
Mapping Scores to Your Use Case
The practical question for engineering teams isn’t “what did this model score on SWE-bench?” but “what does that score predict about its usefulness in our environment?”
| Your Use Case | SWE-bench Relevance | Gaps to Test Separately |
|---|---|---|
| Debugging Python services | High | Your codebase conventions, test quality |
| TypeScript/Node.js development | Low | Run your own evals |
| Feature development | Moderate | Test authoring, architecture decisions |
| Code review assistance | Low | Review quality isn’t measured |
| Test generation | Very Low | Benchmark never tests this |
| Legacy codebase modernization | Low | Multi-file refactors underrepresented |
| Scientific Python work | High | Closest to benchmark distribution |
The key insight: SWE-bench Verified is a strong signal for one narrow capability—debugging and patching existing Python code against a known test oracle. It’s a weak-to-irrelevant signal for everything else.
Alternative Benchmarks and What They Add
SWE-bench doesn’t operate in a vacuum. A more complete picture comes from triangulating across several evaluations:
SWE-bench Multilingual (2025) extends the original to non-Python repositories, addressing the language bias. Still early, with limited model coverage.
EvalPerf focuses on code efficiency rather than correctness—testing whether generated code runs fast, not just whether it runs correctly. Orthogonal to SWE-bench and increasingly relevant as LLM-generated code enters performance-sensitive paths.6
METR Task Suite evaluates agents on longer-horizon software engineering and research tasks with human baselines. It captures the difference between resolving a GitHub issue and designing a system—and finds that current agents plateau around 200,000 tokens of context, regardless of budget.7
LiveCodeBench provides contamination-resistant evaluation using problems released after model training cutoffs, mitigating the data leakage concern inherent in any static benchmark.
What “Verified” Changes About the Methodology
The human verification step isn’t cosmetic. The original SWE-bench had a meaningful false-difficulty rate: issues where the designated tests were incorrect, the problem statement was misleading, or the expected patch didn’t actually fix the described behavior. Verified addresses this by requiring human annotators to confirm solvability.
This creates a more reliable upper bound. A score of 76% on Verified means 76% of 500 instances that a competent human engineer confirmed were solvable, with correctly-specified tests. The original benchmark couldn’t make this guarantee.
The tradeoff: the 500 Verified instances don’t cover the full difficulty distribution of the 2,294-instance dataset. The most ambiguous and complex issues tend to get filtered out. This makes Verified easier on average—and means that scores are not directly comparable between the full dataset and the Verified subset.
Frequently Asked Questions
Q: What does a 49% SWE-bench Verified score actually mean? A: It means the model correctly patched 245 of 500 human-verified Python bug-fix tasks, with all designated tests passing in a containerized environment. It says nothing about code quality, performance, security, or the model’s ability to work on non-Python codebases or write new features.
Q: Why do different agent systems report such different scores for the same model? A: The scaffolding—how the agent is structured, what tools it has, whether it can do multi-rollout review, whether it uses RAG—accounts for a large portion of the variance. The same underlying model can score 30 percentage points differently depending on the agent framework. This is why SWE-bench.com maintains separate leaderboards for bare-model (mini-SWE-agent) and full-system evaluations.
Q: Is SWE-bench Verified subject to training data contamination? A: Potentially, yes. The 12 repositories are widely-referenced Python projects likely present in most model training corpora. While base commits are historical states and specific patches are harder to memorize than functions, there’s no strong contamination guarantee. Models released after the benchmark’s publication have stronger contamination concerns.
Q: Should I use SWE-bench scores when choosing a coding agent for my team? A: Use it as one input, weighted by how closely your work resembles the benchmark tasks. If you write Python and primarily deal with bug fixing in well-tested libraries, SWE-bench scores are meaningful signal. For TypeScript frontends, new feature development, or test authoring, the score has limited predictive value. Supplement with internal evaluation on your actual codebase.
Q: What benchmark better captures real-world engineering work? A: No single benchmark does. METR’s task suite handles longer-horizon work; LiveCodeBench addresses contamination; EvalPerf addresses efficiency. For production decisions, nothing substitutes for running candidate agents against a curated set of real issues from your own repositories.
Footnotes
-
SWE-bench leaderboard, mini-SWE-agent v2.0.0 category. swebench.com, accessed March 2026. ↩
-
Jimenez et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv
.06770, October 2023. ↩ -
SWE-bench Verified description. swebench.com/verified.html. “A human-filtered subset of 500 instances from SWE-bench, created in collaboration with OpenAI.” ↩
-
Anthropic. “Claude 3.5 Sonnet achieves 49% on SWE-bench Verified.” anthropic.com, November 2024. ↩
-
SWE-bench methodology notes. swebench.com. “Version 2.x uses tool calling; 1.x parses actions from output strings. Results across versions aren’t directly comparable.” ↩
-
EvalPlus team. “EvalPerf: Evaluating Code Efficiency via Differential Performance.” evalplus.github.io. ↩
-
METR. “Update on Evaluations.” metr.org, August 2024. “Agent performance on our task suite seems to plateau at around 200 thousand tokens.” ↩