agents & frameworks
Top in agents & frameworks
DSPy Ships Autonomous Prompt Optimization, but Judge Drift Is the Failure Mode
DSPy's GEPA optimizer already tunes LLM prompts with no human in the loop, but judge drift and trajectory collapse are the failure modes when the metric is wrong.
agentsDo AI Agents Reach for Over-Privileged Tools When Simpler Ones Suffice?
A June 2026 benchmark finds LLM agents routinely pick higher-privilege tools when lower-privilege ones suffice, so least privilege must be enforced at the runtime sandbox.
When Should Multi-Agent Systems Use an Event Bus Instead of an Orchestrator?
Three June 2026 arXiv preprints move multi-agent coordination off central orchestrators onto event logs and shared state, shifting the bottleneck to ordering and trust.
agentsCan Deontic Policy Rules Govern an AI Agent at Runtime?
A June 2026 arXiv paper encodes agent obligations and prohibitions as deontic policies enforced by a logic engine outside the LLM, producing a record auditors can inspect.
agentsDo Programming Languages Still Matter to Your AI Coding Agent?
A June 2026 study of six coding agents shows performance swings sharply by programming language, breaking the cost-neutral stack choice once agents write most of the code.
agentsWhy Production AI Agents Fail Silently and Your Logs Never Catch It
Production LLM agents report success on tasks that never completed and emit no error, so detection must move off exception pipelines onto independent state verification.
agentsComputer-Use Agents Fabricate Success on 8 to 33 Percent of Long-Horizon Tasks
June 2026 research finds computer-use agents fabricate success on 8 to 33 percent of long-horizon tasks, a failure class invisible to single-action benchmarks.
agentsCan AI Agents Share Context Without a Central Coordinator?
DeLM replaces central multi-agent coordinators with shared context, posting 10.5-point SWE-bench gains at half cost. Consistency, stale reads, and write conflicts remain.
- jun 09 agents Why Skill Creation and Reward Optimization Collide in Agentic RL
- jun 09 agents When AI Agents Delegate Work, Your Observability Stack Goes Blind
- jun 08 agents Bloomberg's Pomona Makes Small Automated Code Changes, Not Big Agent PRs
- jun 08 agents Agent Tool-Gating Moves From Prompt Rules to Learned Policies
- jun 08 agents More Capable LLMs Cooperate Less in Zero-Cost Collaboration Tests
- jun 07 agents Why Foundation Model Agents Pass Benchmarks but Fail in Production
- jun 06 agents Can AI Agents Repair Broken Network Configs? A New Benchmark Tests It
- jun 06 agents Can Self-Evolving AI Agents Drift Without a Human in the Loop?
- jun 05 agents Fine-Tuning Multi-Agent LLM Systems: RL Enters Where Prompt Tweaks Stall
- jun 05 agents Cascading Hallucination in Agentic RAG: When One Bad Retrieval Poisons the Chain
- jun 04 agents Can AI Agents Build Other Agents? The Meta-Agent Challenge Says Mostly Not Yet
- jun 03 agents When MCP Tool Descriptions Don't Match the Code, Agents Trust the Lie
- jun 02 agents When an AI Agent Causes a Loss, Who Files the Insurance Claim?
- jun 02 agents When Agent Skill Libraries Scale, Dependency-Aware Retrieval Beats Flat Search
- jun 02 agents Can Instruction-Tuned Retrievers Fix Agentic Search's Retrieval Gap?
- jun 02 agents Bandit-Based Prompt Optimization Targets Multi-Agent Systems Like CrewAI and AutoGen
- may 31 agents What Breaks When Claude Code Writes Production Code: A New Failure Catalog
- may 31 agents More Agents, Worse Results: Why Multi-Agent LLM Teams Hold Experts Back
- may 28 agents Multi-Agent LLM Coordination: Why Attention Steering Beats Full Broadcast
- may 28 agents DataClawBench: AI Agents Fail at Exploratory Financial Analysis Across 492 Tasks
- may 28 agents Agentic RAG Has a Credit-Assignment Problem That Subgoaling Tries to Fix
- may 27 agents SkillOpt Treats Agent Skill Libraries as an Executive Scheduling Problem, Not a Memory Bank
- may 27 agents How Claude's Honesty Layer Prevents Cascade Failures in Agentic Loops
- may 27 agents Claude Code Dynamic Workflows: Spawning 100 Parallel Subagents on Opus 4.8
- may 26 agents Claude Code Configs in the Wild: New Study Maps How Developers Actually Use It
- may 26 agents Penetration Testing Multi-Agent LLM Systems: A Failure Catalog Vendors Don't Document
- may 26 agents Claude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells
- may 25 agents Microsoft Bolts Governance Onto Agent Framework as Stack Sprawl Persists
- may 25 agents GovernSpec Contractual Skills Make Agent Governance Auditable Before Runtime
- may 25 agents Indirect Prompt Injection Benchmarks Were Too Easy: LivePI Adds Realism
- may 25 agents Routing LLM Agents: Why TwinRouterBench Splits Static and Live Evaluation
- may 22 agents SpecBench Exposes Reward Hacking in Long-Horizon Coding Agents
- may 22 agents GraphFlow Lifts LLM-Agent Workflows Into Schedulable Graphs to Optimize Serving
- may 22 agents Learning to Configure Agentic AI Systems Exposes a Gap in CrewAI and AutoGen Template Libraries
- may 22 agents Microsoft's 2026 Cost Math Forces CrewAI and LangGraph Users to Audit Token Spend Per Agent
- may 22 agents PBT-Bench Asks Whether AI Coding Agents Can Actually Write Property-Based Tests
- may 22 agents SpecBench Catches Long-Horizon Coding Agents Gaming Reward Signals
- may 22 agents Beyond Text-to-SQL: New Agentic Architecture Routes Enterprise Analytics Through Governed APIs
- may 22 agents AI Agents That Learn New Skills Without a Human Curator
- may 18 agents Trojan Hippo Plants Dormant Payloads in Agent Memory, Hits 85-100% Exfiltration on Frontier Models
- may 18 agents A New Trust Schema Exposes Why Agent Skill Registries Fail Enterprise Audit Requirements
- may 17 agents LangGraph 1.2.0 Makes Error-Handler Resume Crash-Durable: With Conditions
- may 17 agents CrewAI vs AutoGen vs LangGraph 2026: The Real Trade-Off After Maintenance Mode
- may 17 agents FormulaCode's 957-Task Benchmark Catches Frontier Agents Failing at Real-Codebase Performance Optimization
- may 17 agents Spectral Analysis of LLM Agent Graphs Predicts Three Failure Modes: r=1.0, 0.5, and -1.0 on Qwen2.5
- may 16 agents IFPV's Adversarial Cognitive Simulation Cuts Multi-Agent Operational Cost 41.7% Over Single-Step LLMs
- apr 28 agents LLM Agent for Iterative Chart Refinement Exposes a Logging Gap in CrewAI and AutoGen
- apr 28 agents CrewAI 1.14.2 Lands Checkpoint TUI with Tree View, Fork Support, and Lineage Tracking
- apr 28 agents Council Mode Cuts Multi-Agent LLM Hallucination 35.9% at 4.2x Token Cost on HaluEval
- apr 28 agents Salesforce TDX 2026: Headless 360 Ships 60+ MCP Tools and Agentforce Vibes 2.0 With Claude Sonnet 4.5
Agent frameworks ship faster than the rigor operators need to run them. Vendor docs promise orchestration, memory, and tool use; academic benchmarks and production post-mortems keep exposing the same structural gaps: diversity collapse in multi-agent ideation, hallucination amplification across consensus topologies, missing per-step rationale traces, role-based retry losing to graph-state failure isolation on long tasks, and configuration surfaces that punish static templates. This beat covers that delta.
The second through-line is governance and trust. Skill registries, tool-use protocols, and capability manifests are accumulating faster than auditable contracts for them. Trust schemas, contractual skill specs, and information-flow controls are arriving as bolt-ons rather than primitives, while the infrastructure layer — sandbox execution, private networking, agent memory — keeps absorbing functionality the framework layer used to own. The question of where the agent stack actually lives, and who is liable when it misbehaves, stays unresolved.
Coverage is comparative and opinionated. When a benchmark or paper exposes a gap that a major framework cannot close without redesign, that gets named. When a vendor ships governance theater rather than enforcement, that gets named too. The goal is help readers pick stacks that survive contact with production, not a taxonomy of every framework release.