Groundy — independent coverage of developer tools, infrastructure, and platforms
Can Self-Evolving AI Agents Drift Without a Human in the Loop?
Self-evolving AI agents drift without checkpoints: 94% of reviewers miss agent sabotage, safety hardening does not transfer across domains, and stale memory degrades tasks.
cultureA Covert LLM Persuasion Experiment Was Shut Down: How Far Did the Bots Get?
A 2026 analysis of the bot comment archive from a halted Reddit experiment catalogs fabricated identities and bias triggers, but early shutdown leaves harm unmeasurable.
Indexing Images for RAG: kapa.ai's Approach to Multimodal Retrieval
kapa.ai's data shows indexing image captions at ingestion adds 1-6% query overhead versus 27-51% for raw query-time vision, shifting recall risk to caption fidelity.
modelsCan LLMs Leak Training Data? A New Test Splits Capacity From Intent
PropMe splits memorization audits into capability and propensity, showing that single-metric leakage reports understate what targeted prompts can extract from LLMs.
policyGDPR Rectification Rights Have No Clear Owner in ML Supply Chains
A 2026 arXiv paper shows GDPR rectification and erasure rights become unenforceable across ML supply chains where no party can trace a subject's data inside trained weights.
securityBenchmarking RAG Over Cyber Threat Intelligence: Where Retrieval Breaks
CTIConnect, a KDD 2026 benchmark of 1,860 QA pairs across five CTI feeds, shows retrieval quality, not model size, determines copilot accuracy across ten LLMs.
modelsWhen an AI Agent's Tools Break, Can It Recover? A New Benchmark
ToolMaze, a new arXiv benchmark, shows LLM agents' recovery rates drop 37% when tools return corrupted data, exposing a gap in how agent reliability is measured.
industryUS Hyperscale Data Centers: A Carbon Audit That Recasts AI Power Costs
A facility-level audit of 403 US hyperscale centers finds 545 gCO2/kWh, 48% above the grid average. Siting in fossil-heavy regions, not PPAs, determines actual emissions.
- models Opus 4.8 vs Opus 4.7: What Changed and What Did Not
- agents Claude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells
- devtools Anthropic Buys Stainless: OpenAI and Google Now Depend on a Rival for SDK Tooling
- agents A New Trust Schema Exposes Why Agent Skill Registries Fail Enterprise Audit Requirements
- policy FTC's TAKE IT DOWN Act Lands May 19: 48-Hour Deepfake NCII Takedowns and No Safe Harbor
- devtools GitHub Copilot vs Cursor vs Claude Code: The 2026 AI Coding Showdown
- models AI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code?
- infra MLX vs llama.cpp on Apple Silicon: Which Runtime to Use for Local LLM Inference
- devtools Claude Code in GitHub Actions: A Complete Guide to Automated PR Fixes
- models Chinese AI Models Compared: DeepSeek, Qwen, Kimi, Doubao, and Ernie
- devtools Claude Code Plugins: Anthropic's Official Plugin Ecosystem Explained
- devtools GitHub Copilot's Opus 4.7 Multiplier: 7.5x to 15x to 27x in 60 Days
- devtools GitHub Copilot Replaces Premium Request Units With Token-Metered AI Credits on June 1
- culture EU's 2027 Replaceable Battery Mandate: What It Means for Phone Buyers and Repairers Right Now
- industry Cursor's Meteoric Rise: From $300M to $3B ARR in a Year
- jun 06 agents Can Self-Evolving AI Agents Drift Without a Human in the Loop?
- jun 06 culture A Covert LLM Persuasion Experiment Was Shut Down: How Far Did the Bots Get?
- jun 06 infra Indexing Images for RAG: kapa.ai's Approach to Multimodal Retrieval
- jun 06 models Can LLMs Leak Training Data? A New Test Splits Capacity From Intent
- jun 06 policy GDPR Rectification Rights Have No Clear Owner in ML Supply Chains
- jun 06 security Benchmarking RAG Over Cyber Threat Intelligence: Where Retrieval Breaks
- jun 06 models When an AI Agent's Tools Break, Can It Recover? A New Benchmark
- jun 06 industry US Hyperscale Data Centers: A Carbon Audit That Recasts AI Power Costs
- jun 05 infra The RTX Spark Bet on Unified Memory for Local LLMs: Where Bandwidth Caps It
- jun 05 infra Reading Vercel's Fluid Compute vs Cloudflare Workers Benchmark
- jun 05 agents Fine-Tuning Multi-Agent LLM Systems: RL Enters Where Prompt Tweaks Stall
- jun 05 security Stronger Safety Alignment Made LLMs Easier to Jailbreak, Not Harder
- jun 05 security SAML Signature Bypass Is Back: Inside the SAMLStorm Vulnerability Class
- jun 05 policy When LLM Safety Lives at Inference, Not Training: A Certification Gap
- jun 05 culture Do LLMs Understand Idioms in Low-Resource Languages?
- jun 05 infra Does CUDA Tile Match Hand-Tuned Kernels on Hopper and Blackwell?
- jun 05 security SAMLStorm: The SAML Signature Bug That Forges Valid SSO Logins
- jun 05 models MiniMax M3 Bets on Sparse Attention for 1M Context. Does the Math Hold?
- jun 05 models Can One Model Handle Every CAD Task? UniCAD Tests It
- jun 05 models Do Foundation Models Actually Learn Relational Structure In-Context?
- jun 05 models Can LLMs Write Better Research Paper Titles Than Authors?
- jun 05 models Does Information-Theoretic Example Selection Beat kNN for In-Context Learning?
- jun 05 infra Pod-Level Remote Attestation in Kubernetes: Confidential Workloads on dstack
- jun 05 models Do Concept Bottleneck Model Benchmarks Measure Interpretability or Dataset Bias?
- jun 05 agents Cascading Hallucination in Agentic RAG: When One Bad Retrieval Poisons the Chain
- jun 05 security Vercel's Flags SDK Exposed Feature-Flag Definitions via CVE-2025-46332
- jun 05 models Continuous Bit-Width Quantization vs Fixed INT4: Does LiftQuant Beat Discrete?
- jun 04 models Federated Learning for Industrial IoT Anomaly Detection: The Data-Locality Tradeoff
- jun 04 infra Generating GPU Kernels for Moore Threads Silicon: Can LLMs Break CUDA Lock-In?
- jun 04 devtools Alibaba's Open Code Review Moves AI Review Into the CLI, Not the PR
- jun 04 infra Microsoft's Azure Linux Goes General-Purpose: The Container Base-Image Play
- jun 04 models Reading Failed LLM Reasoning Traces Won't Tell You Which Ones RL Can Fix
- jun 04 agents Can AI Agents Build Other Agents? The Meta-Agent Challenge Says Mostly Not Yet
- jun 04 models Can You Stitch Two Foundation Models Together Without Retraining?
- jun 04 infra Cloudflare Acquires VoidZero, the Company Behind Vite's Rust Toolchain
- jun 04 security Jailbreak Suffixes Hit Harder at Specific Token Positions, New GCG Variant Shows
- jun 04 policy When Should an LLM Forget You? A Benchmark for Deciding What Memory to Drop
- jun 04 security OpenAI Adds Lockdown Mode to ChatGPT, Shifting Prompt-Injection Risk to Users
- jun 04 policy When RL Training Rewards Capability-Seeking: A New Alignment Risk
- jun 04 models Do Reasoning LLMs Waste Tokens? OckBench Tries to Measure It
- jun 04 security Activation Steering Was Sold as LLM Control. New Work Makes It an Attack Surface
- jun 04 culture Can Teaching Logical Fallacies Inoculate People Against AI Misinformation?
- jun 04 devtools Vercel Ships Experimental Native CLI Binaries to Cut the Node Startup Tax
- jun 04 security Catching LLM Agents Leaking Credentials From Their Own Activations
- jun 04 policy Refusal Steering Targets Individual Experts in MoE LLMs
- jun 04 infra Putting a Datacenter V100 in a Gaming PC: The Local LLM Math
- jun 04 devtools Vercel Rebuilds Its Marketplace CLI for Agents Instead of Humans
- jun 04 security The 2026 npm Attacks Proved AI Coding Assistants Are a Supply-Chain Target
- jun 03 security ChatGPT's New Lockdown Mode Borrows Apple's Name for a Prompt-Injection Kill Switch
- jun 03 agents When MCP Tool Descriptions Don't Match the Code, Agents Trust the Lie