Topic

#benchmarks

10 articles exploring benchmarks. Expert insights and analysis from our editorial team.

Showing 1–10 of 10 articles

Articles

Newest first

DPrivBench Exposes a Blind Spot: LLMs Can't Reliably Verify Their Own Differential Privacy Guarantees

A new benchmark tests 11 LLMs on 720 DP verification tasks. Top models ace textbook questions — then fall apart on the algorithms that actually appear in production privacy code.

April 21, 2026 · 6 min read

Agents & Frameworks

ml-intern's 32% GPQA Gain on a Single H100 Exposes the Assumption That Post-Training Still Needs a Human ML Researcher

ml-intern hit 32% on GPQA in under 10 hours, beating Claude Code's 22.99% on the same task — but a 51% instruction-tuned ceiling marks what the autonomous loop cannot close.

April 21, 2026

Ethics, Policy & Safety

Stanford's 2026 AI Index: Frontier Model Transparency Scores Collapsed 31% in One Year

The 2025 FMTI found average transparency scores dropped from 58 to 40 in a single year. Here's what that means for auditors and responsible deployment.

April 19, 2026 · 6 min read

Developer Tools

Cursor vs Windsurf vs GitHub Copilot: Real-World Benchmark on a 50k-Line Codebase

Beyond synthetic benchmarks — Cursor, Windsurf, and GitHub Copilot tested on production refactor tasks. Which tool earns its subscription?

March 23, 2026 · 9 min read

Models & Research

Qwen 2.5 vs Llama 3.3: The Open-Weight Showdown Nobody Is Talking About

Alibaba's Qwen 2.5 beats Meta's Llama 3.3 on math, multilingual tasks, and structured data — yet gets a fraction of the Western press coverage.

March 23, 2026 · 8 min read

Developer Tools

SWE-bench Verified Explained: What the Coding Agent Leaderboard Actually Measures (and What It Misses)

SWE-bench Verified tests AI agents on 500 real GitHub bug fixes. Learn what 'resolved 49%' means, how scoring works, and the benchmark's critical blind spots.

March 23, 2026 · 8 min read

Agents & Frameworks

SWE-Bench's Dirty Secret: AI-Passing PRs That Real Engineers Would Reject

New research from METR shows roughly half of SWE-bench-passing AI-generated PRs would be rejected by actual project maintainers—exposing a 24-percentage-point gap between benchmark scores and real-world code acceptability.

March 13, 2026 · 9 min read

Models & Research

Gemini 3.1 Pro: Google's New Reasoning Model Explained

Gemini 3.1 Pro is Google's latest reasoning-focused AI model, achieving 77.1% on ARC-AGI-2 benchmarks—more than double the performance of its predecessor. Here's how it compares to Claude and GPT.

February 18, 2026 · 8 min read

Models & Research

AI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code?

Claude 3.5 Sonnet, GPT-4o, Gemini 2.5 Pro, and open-source models like Qwen2.5-Coder and DeepSeek show competitive performance on benchmarks, but real-world coding tasks reveal significant gaps between benchmark scores and practical utility.

February 14, 2026 · 8 min read

Developer Tools

Claude Code /fast Mode: Is 6x Pricing Worth It?

Anthropic's new fast mode for Claude Opus 4.6 promises 2.5x faster responses at 6x the cost. We analyze the speed vs. cost tradeoff, real-world use cases, and optimization strategies to help you decide when the premium is worth paying.

February 13, 2026 · 7 min read

Browse All Topics