Topic

#benchmarks

10 articles exploring benchmarks. Expert insights and analysis from our editorial team.

Showing 1–10 of 10 articles

Articles

Newest first
Security

DPrivBench Exposes a Blind Spot: LLMs Can't Reliably Verify Their Own Differential Privacy Guarantees

A new benchmark tests 11 LLMs on 720 DP verification tasks. Top models ace textbook questions — then fall apart on the algorithms that actually appear in production privacy code.

· 6 min read
Agents & Frameworks

ml-intern's 32% GPQA Gain on a Single H100 Exposes the Assumption That Post-Training Still Needs a Human ML Researcher

ml-intern hit 32% on GPQA in under 10 hours, beating Claude Code's 22.99% on the same task — but a 51% instruction-tuned ceiling marks what the autonomous loop cannot close.

Ethics, Policy & Safety

Stanford's 2026 AI Index: Frontier Model Transparency Scores Collapsed 31% in One Year

The 2025 FMTI found average transparency scores dropped from 58 to 40 in a single year. Here's what that means for auditors and responsible deployment.

· 6 min read
Developer Tools

Cursor vs Windsurf vs GitHub Copilot: Real-World Benchmark on a 50k-Line Codebase

Beyond synthetic benchmarks — Cursor, Windsurf, and GitHub Copilot tested on production refactor tasks. Which tool earns its subscription?

· 9 min read
Models & Research

Qwen 2.5 vs Llama 3.3: The Open-Weight Showdown Nobody Is Talking About

Alibaba's Qwen 2.5 beats Meta's Llama 3.3 on math, multilingual tasks, and structured data — yet gets a fraction of the Western press coverage.

· 8 min read
Developer Tools

SWE-bench Verified Explained: What the Coding Agent Leaderboard Actually Measures (and What It Misses)

SWE-bench Verified tests AI agents on 500 real GitHub bug fixes. Learn what 'resolved 49%' means, how scoring works, and the benchmark's critical blind spots.

· 8 min read
Agents & Frameworks

SWE-Bench's Dirty Secret: AI-Passing PRs That Real Engineers Would Reject

New research from METR shows roughly half of SWE-bench-passing AI-generated PRs would be rejected by actual project maintainers—exposing a 24-percentage-point gap between benchmark scores and real-world code acceptability.

· 9 min read
Models & Research

Gemini 3.1 Pro: Google's New Reasoning Model Explained

Gemini 3.1 Pro is Google's latest reasoning-focused AI model, achieving 77.1% on ARC-AGI-2 benchmarks—more than double the performance of its predecessor. Here's how it compares to Claude and GPT.

· 8 min read
Models & Research

AI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code?

Claude 3.5 Sonnet, GPT-4o, Gemini 2.5 Pro, and open-source models like Qwen2.5-Coder and DeepSeek show competitive performance on benchmarks, but real-world coding tasks reveal significant gaps between benchmark scores and practical utility.

· 8 min read
Developer Tools

Claude Code /fast Mode: Is 6x Pricing Worth It?

Anthropic's new fast mode for Claude Opus 4.6 promises 2.5x faster responses at 6x the cost. We analyze the speed vs. cost tradeoff, real-world use cases, and optimization strategies to help you decide when the premium is worth paying.

· 7 min read