Groundy — independent coverage of developer tools, infrastructure, and platforms
Claude Fable 5 Benchmarks: What FrontierCode, CursorBench, and ViBench Show
Claude Fable 5 claims top benchmark scores. Verified data shows every model below 14% on FrontierCode Diamond, and partner scores lack public methodology.
agentsComputer-Use Agents Fabricate Success on 8 to 33 Percent of Long-Horizon Tasks
June 2026 research finds computer-use agents fabricate success on 8 to 33 percent of long-horizon tasks, a failure class invisible to single-action benchmarks.
Running RAG on a Snapdragon NPU: The On-Device Retrieval Tradeoff
End-to-end RAG on the Snapdragon X Elite Hexagon NPU delivers 4x lower latency and 4x less energy than CPU with no quality loss, but soldered memory caps your index size.
modelsDoes Attribution Patching Lie? A Fix for a Common Interpretability Shortcut
A June 2026 paper traces attribution patching's errors to downstream non-linearities and proposes a Hessian-vector-product correction that costs one extra backward pass.
modelsCan You Make a Multimodal Model Unlearn With Activation Steering?
Steering vectors suppress behavior at runtime without editing weights. Two 2026 papers show they transfer between models, so suppression alone is not unlearning.
modelsWhy Pruning a Model Can Raise Its Out-of-Distribution Accuracy
Task-aware layer pruning removes distortion-amplifying layers and improves out-of-distribution accuracy, which means standard in-distribution benchmarks miss the real effect.
industryVercel's Turborepo: Build Speed Becomes a Hosting-Vendor Feature
Vercel's Turborepo ownership routes monorepo CI caching through its hosting by default. A $9.3B valuation intensifies incentives to keep that coupling in place.
securityOpenAI Frames Instruction Hierarchy as an Open Challenge, Not a Prompt-Injection Fix
OpenAI's IH-Challenge frames instruction hierarchy as an open benchmark, not a shipped defense, shifting prompt-injection protection to orchestration-layer filtering.
- infra MiniMax M3 Ships 1M Context and Desktop Control as Open Weights
- agents When AI Agents Delegate Work, Your Observability Stack Goes Blind
- models Claude Fable 5 vs Opus 4.8: When 2x Pricing Is Worth It
- models Opus 4.8 vs Opus 4.7: What Changed and What Did Not
- models Project Glasswing One Month In: AI Bug Discovery Has Outpaced the Patch Pipeline
- models AI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code?
- devtools GitHub Copilot vs Cursor vs Claude Code: The 2026 AI Coding Showdown
- industry Cursor's Meteoric Rise: Inside the AI Editor Hitting $300M ARR
- infra MLX vs llama.cpp on Apple Silicon: Which Runtime to Use for Local LLM Inference
- models Chinese AI Models Compared: DeepSeek, Qwen, Kimi, Doubao, and Ernie
- devtools Claude Code in GitHub Actions: A Complete Guide to Automated PR Fixes
- devtools Claude Code Plugins: Anthropic's Official Plugin Ecosystem Explained
- industry OpenAI Offers Two Months of Free Codex to Enterprises Switching From Claude Within 30 Days
- infra Prefill-Decode Disaggregation: The Architecture Shift Redefining LLM Serving at Scale
- devtools GitHub Copilot's Opus 4.7 Multiplier: 7.5x to 15x to 27x in 60 Days
- jun 10 models Claude Fable 5 Benchmarks: What FrontierCode, CursorBench, and ViBench Show
- jun 11 agents Computer-Use Agents Fabricate Success on 8 to 33 Percent of Long-Horizon Tasks
- jun 10 infra Running RAG on a Snapdragon NPU: The On-Device Retrieval Tradeoff
- jun 10 models Does Attribution Patching Lie? A Fix for a Common Interpretability Shortcut
- jun 11 models Can You Make a Multimodal Model Unlearn With Activation Steering?
- jun 11 models Why Pruning a Model Can Raise Its Out-of-Distribution Accuracy
- jun 11 industry Vercel's Turborepo: Build Speed Becomes a Hosting-Vendor Feature
- jun 10 security OpenAI Frames Instruction Hierarchy as an Open Challenge, Not a Prompt-Injection Fix
- jun 10 devtools JetBrains Mellum2: A 12B Open-Weights Code Model for Self-Hosted Completion
- jun 09 models Do Unified Multimodal Models Actually Interleave Understanding and Generation?
- jun 09 agents Can AI Agents Share Context Without a Central Coordinator?
- jun 09 agents Why Skill Creation and Reward Optimization Collide in Agentic RL
- jun 09 infra GraphRAG vs VectorRAG: Does the Graph Index Earn Its Cost?
- jun 09 models How LLMs Track Who Did What: The Entity Rebinding Circuit
- jun 09 devtools Vercel's Chat SDK Targets Every Chat Platform From One Codebase
- jun 09 infra MiniMax M3 Ships 1M Context and Desktop Control as Open Weights
- jun 09 devtools NPM v12 Breaking Changes: Auditing Your Lockfiles Before the Upgrade
- jun 09 infra DeepSeek-V4 FlashMemory: Sparse Attention for Million-Token Context
- jun 09 agents When AI Agents Delegate Work, Your Observability Stack Goes Blind
- jun 09 models Claude Fable 5 vs Opus 4.8: When 2x Pricing Is Worth It
- jun 09 models Claude Mythos 5 Access Rules: Who Gets Project Glasswing and Why
- jun 09 policy Fable 5 Biology Classifiers: How Flagged Prompts Fall Back to Opus 4.8
- jun 09 industry Fable 5 Credit Cliff: What the June 23 Billing Shift Means for Teams
- jun 09 models Fable 5 Distillation Protection: How Anthropic Blocks Model Copying
- jun 09 models Skip Fable 5 or Upgrade? When Opus 4.8 and Sonnet 4.6 Are Still Enough
- jun 08 security Skill Injection: Hiding Undetectable Instructions in What an AI Agent Loads
- jun 08 models LLM Steganography: Can Defenders Detect Payloads Hidden in Model Output?
- jun 08 policy Who Gets to Audit Your Health Chatbot? Almost No One
- jun 08 policy Do Word-Subset Explanations Satisfy the EU AI Act's Transparency Rule?
- jun 08 infra Is Cloudflare's Bot Traffic Surge Real? The Measurement Dispute
- jun 08 industry OpenAI Pushes ChatGPT Into Compensation Data, Pressuring Mercer and Radford
- jun 08 policy Bit-Exact Inference Verification Gives AI Audits a Proof Mechanism
- jun 08 models Do Privacy Defenses Actually Protect Fine-Tuned LLMs? A New Benchmark
- jun 08 models Can You Reconstruct an LLM's System Prompt From Its Activations?
- jun 08 policy Can a Robot's Own Attention Flag Its Unsafe Actions Before They Run?
- jun 08 devtools Can a CLI Replace Screenshots for GUI Automation Agents?
- jun 08 agents Bloomberg's Pomona Makes Small Automated Code Changes, Not Big Agent PRs
- jun 08 agents Agent Tool-Gating Moves From Prompt Rules to Learned Policies
- jun 08 culture Does Debate Quality Survive When LLMs Argue Outside English?
- jun 08 security Splitting a Malicious Task Across Tool Calls Slips Past LLM Agent Guardrails
- jun 08 agents More Capable LLMs Cooperate Less in Zero-Cost Collaboration Tests
- jun 08 policy Can One Safety Adapter Realign Every Fine-Tuned LLM?
- jun 08 industry Bending Spoons Files to IPO: The App Roll-Up Playbook Goes Public
- jun 08 devtools How Cursor Uses GPT-5: What OpenAI's Writeup Tells Coding Teams
- jun 08 oss DuckDB Queries Hugging Face Parquet Files Over HTTP Without Downloads
- jun 08 models Does Softmax Normalization Limit What Attention Can Represent?
- jun 08 infra Huawei's KVarN Puts KV-Cache Quantization Inside vLLM's Backend
- jun 07 policy Can AI Be Aligned Without Modeling Human Cognitive Diversity?
- jun 07 models Can an Attacker Steal Your Model's Last Layer From Its Outputs?
- jun 07 policy Is the Pentagon's Software Pathway Ready to Buy AI Systems?