agents & frameworks

45 articles · rss

Multi-Agent LLM Coordination: Why Attention Steering Beats Full Broadcast

Multi-agent LLM systems that broadcast every message to every peer waste tokens and lose accuracy. Agent-Radar steers attention by relevance for 7.64-point gains.

may 28

agents

DataClawBench: AI Agents Fail at Exploratory Financial Analysis Across 492 Tasks

DataClawBench finds eight frontier AI agents reliably fail at exploratory financial analysis across 492 tasks, breaking at hypothesis generation rather than query execution.

may 28

agents

Agentic RAG Has a Credit-Assignment Problem That Subgoaling Tries to Fix

APEX-Searcher splits agentic RAG into separate planning and retrieval training stages so teams can pinpoint whether a wrong answer came from a bad plan or a bad fetch.

agents

SkillOpt Treats Agent Skill Libraries as an Executive Scheduling Problem, Not a Memory Bank

SkillOpt treats agent skills as trainable state with deletion and budgeted edits, sweeping 52 of 52 benchmarks. Append-only registries in agent frameworks are a design error.

agents

Claude Code Dynamic Workflows: Spawning 100 Parallel Subagents on Opus 4.8

Dynamic workflows lets Claude Code run hundreds of parallel subagents in one session. Here is how map-reduce and fan-out patterns work on Opus 4.8.

agents

How Opus 4.8 Honesty Prevents Cascade Failures in Agentic Loops

Opus 4.8 flags uncertainties more often and makes fewer unsupported claims, reducing hallucinated API calls and memory drift in 100+ turn autonomous workflows.

agents

Penetration Testing Multi-Agent LLM Systems: A Failure Catalog Vendors Don't Document

The first independent pen tests of proprietary agent deployments found preventable classical vulnerabilities, not novel AI flaws, compounding across multi-agent topologies.

agents

Claude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells

Indirect prompt injection through repo artifacts turns coding agents into attacker shells, exploiting the file-write and shell privileges agents already hold.

more in this beat 1–37 of 37

about this beat editorial framing

Agent frameworks ship faster than the rigor operators need to run them. Vendor docs promise orchestration, memory, and tool use; academic benchmarks and production post-mortems keep exposing the same structural gaps: diversity collapse in multi-agent ideation, hallucination amplification across consensus topologies, missing per-step rationale traces, role-based retry losing to graph-state failure isolation on long tasks, and configuration surfaces that punish static templates. This beat covers that delta.

The second through-line is governance and trust. Skill registries, tool-use protocols, and capability manifests are accumulating faster than auditable contracts for them. Trust schemas, contractual skill specs, and information-flow controls are arriving as bolt-ons rather than primitives, while the infrastructure layer — sandbox execution, private networking, agent memory — keeps absorbing functionality the framework layer used to own. The question of where the agent stack actually lives, and who is liable when it misbehaves, stays unresolved.

Coverage is comparative and opinionated. When a benchmark or paper exposes a gap that a major framework cannot close without redesign, that gets named. When a vendor ships governance theater rather than enforcement, that gets named too. The goal is help readers pick stacks that survive contact with production, not a taxonomy of every framework release.

Top in agents & frameworks

Multi-Agent LLM Coordination: Why Attention Steering Beats Full Broadcast

DataClawBench: AI Agents Fail at Exploratory Financial Analysis Across 492 Tasks

Agentic RAG Has a Credit-Assignment Problem That Subgoaling Tries to Fix

SkillOpt Treats Agent Skill Libraries as an Executive Scheduling Problem, Not a Memory Bank

Claude Code Dynamic Workflows: Spawning 100 Parallel Subagents on Opus 4.8

How Opus 4.8 Honesty Prevents Cascade Failures in Agentic Loops

Penetration Testing Multi-Agent LLM Systems: A Failure Catalog Vendors Don't Document

Claude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells