agents & frameworks
Top in agents & frameworks
Multi-Agent LLM Coordination: Why Attention Steering Beats Full Broadcast
Multi-agent LLM systems that broadcast every message to every peer waste tokens and lose accuracy. Agent-Radar steers attention by relevance for 7.64-point gains.
agentsDataClawBench: AI Agents Fail at Exploratory Financial Analysis Across 492 Tasks
DataClawBench finds eight frontier AI agents reliably fail at exploratory financial analysis across 492 tasks, breaking at hypothesis generation rather than query execution.
Agentic RAG Has a Credit-Assignment Problem That Subgoaling Tries to Fix
APEX-Searcher splits agentic RAG into separate planning and retrieval training stages so teams can pinpoint whether a wrong answer came from a bad plan or a bad fetch.
agentsSkillOpt Treats Agent Skill Libraries as an Executive Scheduling Problem, Not a Memory Bank
SkillOpt treats agent skills as trainable state with deletion and budgeted edits, sweeping 52 of 52 benchmarks. Append-only registries in agent frameworks are a design error.
agentsClaude Code Dynamic Workflows: Spawning 100 Parallel Subagents on Opus 4.8
Dynamic workflows lets Claude Code run hundreds of parallel subagents in one session. Here is how map-reduce and fan-out patterns work on Opus 4.8.
agentsHow Opus 4.8 Honesty Prevents Cascade Failures in Agentic Loops
Opus 4.8 flags uncertainties more often and makes fewer unsupported claims, reducing hallucinated API calls and memory drift in 100+ turn autonomous workflows.
agentsPenetration Testing Multi-Agent LLM Systems: A Failure Catalog Vendors Don't Document
The first independent pen tests of proprietary agent deployments found preventable classical vulnerabilities, not novel AI flaws, compounding across multi-agent topologies.
agentsClaude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells
Indirect prompt injection through repo artifacts turns coding agents into attacker shells, exploiting the file-write and shell privileges agents already hold.
- may 26 agents Claude Code Configs in the Wild: New Study Maps How Developers Actually Use It
- may 25 agents Microsoft Bolts Governance Onto Agent Framework as Stack Sprawl Persists
- may 25 agents GovernSpec Contractual Skills Make Agent Governance Auditable Before Runtime
- may 25 agents Indirect Prompt Injection Benchmarks Were Too Easy: LivePI Adds Realism
- may 25 agents Routing LLM Agents: Why TwinRouterBench Splits Static and Live Evaluation
- may 22 agents SpecBench Exposes Reward Hacking in Long-Horizon Coding Agents
- may 22 agents GraphFlow Lifts LLM-Agent Workflows Into Schedulable Graphs to Optimize Serving
- may 22 agents Learning to Configure Agentic AI Systems Exposes a Gap in CrewAI and AutoGen Template Libraries
- may 22 agents Microsoft's 2026 Cost Math Forces CrewAI and LangGraph Users to Audit Token Spend Per Agent
- may 22 agents PBT-Bench Asks Whether AI Coding Agents Can Actually Write Property-Based Tests
- may 22 agents SpecBench Catches Long-Horizon Coding Agents Gaming Reward Signals
- may 22 agents Beyond Text-to-SQL: New Agentic Architecture Routes Enterprise Analytics Through Governed APIs
- may 22 agents AI Agents That Learn New Skills Without a Human Curator
- may 18 agents Trojan Hippo Plants Dormant Payloads in Agent Memory, Hits 85-100% Exfiltration on Frontier Models
- may 18 agents A New Trust Schema Exposes Why Agent Skill Registries Fail Enterprise Audit Requirements
- may 17 agents LangGraph 1.2.0 Makes Error-Handler Resume Crash-Durable: With Conditions
- may 17 agents CrewAI vs AutoGen vs LangGraph 2026: The Real Trade-Off After Maintenance Mode
- may 17 agents FormulaCode's 957-Task Benchmark Catches Frontier Agents Failing at Real-Codebase Performance Optimization
- may 17 agents Spectral Analysis of LLM Agent Graphs Predicts Three Failure Modes: r=1.0, 0.5, and -1.0 on Qwen2.5
- may 16 agents IFPV's Adversarial Cognitive Simulation Cuts Multi-Agent Operational Cost 41.7% Over Single-Step LLMs
- apr 28 agents LLM Agent for Iterative Chart Refinement Exposes a Logging Gap in CrewAI and AutoGen (see also logging gap in CrewAI)
- apr 28 agents CrewAI 1.14.2 Lands Checkpoint TUI with Tree View, Fork Support, and Lineage Tracking
- apr 28 agents Council Mode Cuts Multi-Agent LLM Hallucination 35.9% at 4.2x Token Cost on HaluEval
- apr 28 agents Salesforce TDX 2026: Headless 360 Ships 60+ MCP Tools and Agentforce Vibes 2.0 With Claude Sonnet 4.5
- apr 23 agents Cloudflare Agents Week Moved Sandbox Execution, Private Networking, and Memory to Network Primitives
- apr 22 agents Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling, Not Topology, Breaks Open-Ended Ideation
- apr 22 agents Nous Research's Hermes Ships Persistent Memory and Auto-Skill Capture: CrewAI and AutoGen Must Reconsider
- apr 21 agents ml-intern's 32% GPQA Gain on One H100 Exposes the Assumption That Post-Training Still Needs a Human Researcher
- mar 14 agents AI Agents That Actually Learn: The Architecture Behind Hindsight Memory
- mar 26 agents InsForge: The Backend Framework Built for Agentic Applications
- feb 27 agents Superpowers: The Agentic Framework Replacing Your Dev Process
- feb 26 agents How AI Agents Remember: Memory Architectures That Work
- feb 11 agents CrewAI vs AutoGen: A Developer's Guide to Multi-Agent AI Frameworks
- feb 17 agents Function Calling Best Practices: LLMs That Actually Use APIs Correctly
- feb 10 agents How to Build Your First Autonomous Coding Agent with OpenHands SDK
- feb 10 agents Pydantic AI vs LangChain: A Developer's Guide to the New Generation of Agent Frameworks
- feb 11 agents Are AI-Generated PRs Killing Open Source?
Agent frameworks ship faster than the rigor operators need to run them. Vendor docs promise orchestration, memory, and tool use; academic benchmarks and production post-mortems keep exposing the same structural gaps: diversity collapse in multi-agent ideation, hallucination amplification across consensus topologies, missing per-step rationale traces, role-based retry losing to graph-state failure isolation on long tasks, and configuration surfaces that punish static templates. This beat covers that delta.
The second through-line is governance and trust. Skill registries, tool-use protocols, and capability manifests are accumulating faster than auditable contracts for them. Trust schemas, contractual skill specs, and information-flow controls are arriving as bolt-ons rather than primitives, while the infrastructure layer — sandbox execution, private networking, agent memory — keeps absorbing functionality the framework layer used to own. The question of where the agent stack actually lives, and who is liable when it misbehaves, stays unresolved.
Coverage is comparative and opinionated. When a benchmark or paper exposes a gap that a major framework cannot close without redesign, that gets named. When a vendor ships governance theater rather than enforcement, that gets named too. The goal is help readers pick stacks that survive contact with production, not a taxonomy of every framework release.