A March 2026 note from AI safety research organization METR contains an uncomfortable finding: roughly half of code pull requests generated by frontier AI agents—the ones that pass all automated tests on SWE-bench Verified—would be rejected by actual repository maintainers for quality reasons. By at least one practical measure, LLM capability in software engineering has not improved since early 2025. The plateau is real. But diagnosing whether it reflects a fundamental ceiling, an instrumentation failure, or a natural transition between scaling regimes requires looking at the evidence carefully.
What Is the Merge Rate Problem?
The concept of “merge rate” as a capability metric comes from METR’s ongoing effort to track AI progress in ways that matter beyond test-passing. In their March 2026 analysis,1 METR examined AI-generated PRs across SWE-bench Verified and found that roughly half of solutions accepted by the automated grader would not pass human code review. The primary failure modes were not functional gaps—the code worked—but code quality issues: excessive complexity, violations of project conventions, and regressions in adjacent functionality.
This is not a minor benchmarking footnote. The finding implies that the measured “50% solve rate” on SWE-bench Verified reflects something closer to an 8-minute task horizon under strict maintainer standards, compared to the 50-minute horizon under automated-only evaluation.2 The gap between what models can do to satisfy an evaluator and what they can do to satisfy a human collaborator is substantial—and has not narrowed in over a year.
The Hacker News thread responding to METR’s data3 surfaced a telling observation: practitioners who noted that “something happened in 2025 that made Claude Code and similar tools much better” were unable to point to corresponding improvements in merge-rate data. The subjective improvement in developer experience is real; the measurable improvement in code quality, as defined by maintainer standards, is not clearly documented.
The Measurement Problem: Benchmarks That Can No Longer Measure
Part of the apparent plateau is genuine. Part of it is an artifact of how the industry has been measuring progress.
Traditional benchmarks—MMLU, GSM8K, HumanEval—are now largely saturated. According to Stanford’s 2025 AI Index Report,4 frontier models have exceeded 88% accuracy on MMLU, with recent models approaching 93%, making the benchmark useless for differentiating top systems. The industry’s response has been to keep inventing harder benchmarks, but this creates a treadmill effect: models master each new benchmark faster than the next one can be designed and validated.
The saturation problem extends to SWE-bench itself. As OpenAI noted when it retired SWE-bench Verified from its primary evaluation suite,5 the benchmark’s limitations became apparent once frontier models began scoring above 70%. More critically, analysis published in early 2026 found that approximately 60% of “successfully resolved” SWE-bench issues involved some form of solution leakage—where the answer was partially visible in the issue description or comments.6 Benchmark contamination and structural flaws mean that headline scores overstate real-world capability.
The industry has responded by pushing evaluation toward genuinely hard frontiers: “Humanity’s Last Exam” (HLE) and FrontierMath, where current state-of-the-art models score around 8-9% and 2% respectively. These benchmarks remain unsaturated precisely because they require genuine reasoning rather than pattern completion.
Why Pre-training Scaling Is Genuinely Slowing
Beyond measurement artifacts, there are structural reasons why the original scaling laws—more compute, more data, bigger models equals better performance—are delivering diminishing returns.
The primary bottleneck is high-quality training data. Research from Epoch AI7 estimates that current large models have consumed most of the genuinely useful text available on the public web. While approximately 510 trillion tokens are theoretically available, the subset that is high-quality, non-repetitive, and legally usable is significantly smaller. Projections for exhaustion of high-quality human-generated data range from 2026 to 2032, depending on overtraining tolerance and data filtering quality.
This creates a compounding problem: as labs push into lower-quality data to maintain training scale, pre-training returns diminish faster. The dynamic is visible in the broader industry conversation. As Sebastian Raschka’s 2025 state-of-LLMs analysis observed,8 improvements in 2024 were primarily driven by post-training and inference-time techniques rather than pre-training advances—a structural shift from where gains came in 2021-2023.
Dense transformer scaling also hit an architectural ceiling. Scaling a single dense model—more parameters, proportionally more compute per token—delivers sub-linear returns past a certain threshold. The move to Mixture of Experts (MoE) architectures is the direct architectural response to this limit.
The Densing Law: Efficiency Is Still Scaling
The pre-training plateau does not mean LLMs stopped improving. A different axis of progress—capability density—has continued to advance according to a separate scaling law.
Research published in Nature Machine Intelligence in late 2025 by Xiao et al. introduced the “densing law”:9 the observation that capability per parameter doubles approximately every 3.5 months. The metric, called “capability density,” measures how much performance a model achieves relative to its actual parameter count compared to a reference baseline. By this measure, the industry is not plateauing—it is producing equivalent capability with exponentially fewer parameters over time.
In practical terms, a model that required 70 billion parameters to achieve a given benchmark score in early 2024 can match that score with roughly 35 billion parameters in mid-2025 and perhaps 17-18 billion parameters by late 2025. This reflects improvements in training data quality, architectural efficiency (particularly MoE designs), and post-training alignment techniques.
Three Scaling Frontiers Have Replaced One
The original scaling law—compute, data, parameters—has fractured into three distinct and partially independent scaling axes:
| Scaling Axis | Mechanism | Representative Models | Current Status |
|---|---|---|---|
| Pre-training | More compute + data during training | GPT-4, early LLaMA generations | Diminishing returns on dense models; shifted to MoE |
| Post-training | RLHF, DPO, instruction tuning, synthetic data | Claude 3.x series, Gemini 2.x | Active gains; data quality is the constraint |
| Test-time compute | More inference steps, chain-of-thought, verifier models | o1, o3, DeepSeek-R1, Gemini Flash 2.0 | High gains on reasoning tasks; limited benefit on hardest problems |
Research published on OpenReview10 found that optimally scaling test-time compute can be more effective than scaling model parameters for easy-to-medium difficulty tasks. However, the same research found that for genuinely hard problems, pre-training compute remains more effective than inference-time scaling. The practical implication: test-time compute is a partial substitute for pre-training, not a full replacement.
The MoE architectural transition represents the industry’s primary response to the dense scaling ceiling. As NVIDIA documented,11 leading frontier models including GPT-5, Meta’s Llama 4, and Gemini 3 now use MoE designs. DeepSeek-V4’s architecture—1.5 trillion total parameters with only ~30 billion activated per token—demonstrates the economic logic: knowledge capacity scales with total parameters, but inference cost scales with activated parameters. MoE breaks the historically fixed relationship between model capability and per-token compute cost.
What the Data Means for Practitioners
The merge rate plateau exposed by METR is a credible signal that practical LLM utility—as measured by code quality a human would accept—has not improved meaningfully in roughly 12 months, even as benchmark scores continued to rise. This gap has direct implications for how teams should calibrate expectations and workflows.
For AI-assisted software development, this means that higher benchmark scores do not reliably translate to fewer human review cycles or higher code merge rates. The quality gap between what passes automated evaluation and what passes human review remains substantial and appears sticky.
The more actionable finding is that the industry’s center of gravity has shifted from raw capability (what the model knows) to reasoning quality (how it uses what it knows). Models with strong test-time scaling—particularly those with explicit chain-of-thought reasoning trained via reinforcement learning—show differentiated performance on complex tasks that older benchmark scores do not capture.
For teams evaluating models, the MIT Technology Review’s February 2026 analysis12 of what it called “the most misunderstood graph in AI” makes a useful point: METR’s time horizon metric shows continued exponential growth in task completion capability, but the Y-axis measures duration of correctly completed autonomous tasks, not code quality. Progress continues—it just isn’t uniformly distributed across all capability dimensions.
What Comes Next
The structural trajectory points toward several near-term shifts:
Synthetic data at scale. With high-quality human-generated text approaching practical exhaustion, frontier labs are investing heavily in synthetic data pipelines—using models to generate and verify training data. The quality ceiling on synthetic data is still being probed, but early evidence from post-training improvements suggests it can extend useful training data substantially.
Architectural diversification. MoE is now standard; the next architectural bets involve better expert routing, multi-modal efficiency, and long-context architectures that avoid the compute quadratic penalty of full attention.
Reasoning models for complex tasks. The performance gap between standard instruction-following models and reinforcement-learning-trained reasoning models (o-series, R1-style) is meaningful on complex tasks and essentially zero on simple ones. Deployment patterns are converging on routing: cheap fast models for common cases, reasoning models for hard cases.
Evaluation infrastructure catching up. The measurement problem is now clearly identified. Expect new evaluation frameworks that incorporate maintainer-quality standards, adversarial robustness checks, and longitudinal consistency testing rather than single-pass benchmark scores.
The plateau in merge rate improvements is a real and important finding. But it coexists with continued progress in model efficiency, reasoning capability, and architectural design. The scaling era is not ending—it is reorganizing around different axes, with different constraints and different beneficiaries.
Frequently Asked Questions
Q: What is an LLM merge rate, and why does it matter? A: Merge rate measures the fraction of AI-generated code pull requests that human repository maintainers would accept into production—not just whether they pass automated tests. It matters because it reflects real-world code quality rather than benchmark optimization, and METR’s March 2026 data shows no clear improvement in this metric since early 2025.
Q: Are pre-training scaling laws definitively broken? A: Not definitively, but they are delivering diminishing returns on dense transformer architectures, primarily due to high-quality training data constraints. Frontier labs have responded by shifting to Mixture of Experts architectures, post-training scaling, and test-time compute scaling—three distinct levers that partially substitute for pre-training.
Q: What does “capability density” mean and why does it matter? A: Capability density, as defined in the Nature Machine Intelligence densing law paper, measures model performance relative to parameter count. It doubles approximately every 3.5 months—meaning equivalent capability is achievable with exponentially fewer parameters over time, which translates directly to lower inference costs and faster deployment.
Q: Should I still use SWE-bench scores to evaluate AI coding tools? A: Use them with caution. OpenAI retired SWE-bench Verified from its primary evaluation suite after frontier models exceeded 70%, and analysis found roughly 60% solution leakage in the dataset. For meaningful differentiation between frontier models, use SWE-bench Pro (where top models score ~23%) or domain-specific tasks from your actual codebase.
Q: What scaling approach delivers the most practical value in 2026? A: For most production applications, post-training improvements (instruction quality, alignment, RLHF) and test-time compute (reasoning models for complex tasks) are delivering more actionable gains than raw pre-training scale. The optimal choice depends on task complexity: use fast standard models for routine tasks and explicit reasoning models for multi-step problems.
Sources:
- Are LLM merge rates not getting better? | Hacker News
- Many SWE-bench-Passing PRs Would Not Be Merged into Main - METR
- METR Research Update: Algorithmic vs. Holistic Evaluation
- Technical Performance | The 2025 AI Index Report | Stanford HAI
- Densing law of LLMs | Nature Machine Intelligence
- Will we run out of data to train large language models? | Epoch AI
- The State Of LLMs 2025: Progress, Progress, and Predictions
- Scaling LLM Test-Time Compute Optimally | arXiv
- Why SWE-bench Verified no longer measures frontier coding capabilities | OpenAI
- This is the most misunderstood graph in AI | MIT Technology Review
- Mixture of Experts Powers Frontier AI Models | NVIDIA Blog
- Are LLMs not getting better? | entropicthoughts
Footnotes
-
METR. “Many SWE-bench-Passing PRs Would Not Be Merged into Main.” March 10, 2026. https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/ ↩
-
METR. “Research Update: Algorithmic vs. Holistic Evaluation.” August 12, 2025. https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/ ↩
-
Hacker News. “Are LLM merge rates not getting better?” Discussion thread. https://news.ycombinator.com/item?id=47349334 ↩
-
Stanford HAI. “Technical Performance.” 2025 AI Index Report. https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance ↩
-
OpenAI. “Why SWE-bench Verified no longer measures frontier coding capabilities.” https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ ↩
-
DEV Community. “SWE-bench Scores Are Lying to You: Half of Passing PRs Wouldn’t Be Merged.” https://dev.to/benriemer/swe-bench-scores-are-lying-to-you-half-of-passing-prs-wouldnt-be-merged-8h2 ↩
-
Epoch AI. “Will we run out of data to train large language models?” https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data ↩
-
Raschka, Sebastian. “The State Of LLMs 2025: Progress, Progress, and Predictions.” https://magazine.sebastianraschka.com/p/state-of-llms-2025 ↩
-
Xiao et al. “Densing Law of LLMs.” Nature Machine Intelligence, Volume 7, November 2025. https://www.nature.com/articles/s42256-025-01137-0 ↩
-
Snell et al. “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.” OpenReview / arXiv
.03314. https://arxiv.org/abs/2408.03314 ↩ -
NVIDIA Blog. “Mixture of Experts Powers the Most Intelligent Frontier AI Models.” https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/ ↩
-
MIT Technology Review. “This is the most misunderstood graph in AI.” February 5, 2026. https://www.technologyreview.com/2026/02/05/1132254/this-is-the-most-misunderstood-graph-in-ai/ ↩