Multi-Agent LLM Coordination: Why Attention Steering Beats Full Broadcast

Multi-agent LLM systems have a scaling problem, and it is not the one most people assume. The bottleneck is not how capable each agent is, but how much irrelevant context each one ingests from every other agent in the roster. Two papers released in the same week of May 2026, plus a March submission whose findings point in the same direction, converge on the same diagnosis: broadcast-everything message topologies waste tokens, degrade accuracy, and break coherence in ways that more capable individual agents cannot fix.

Why full broadcast breaks under load

The default architecture in most multi-agent frameworks is simple: every agent sends every message to every peer. This works for three-agent demos. At five or more agents with multiple interaction rounds, the conversation history each agent must process grows rapidly. Agent-Radar identifies the core symptom: relevant information gets diluted by irrelevant context as conversations lengthen, shifting the bottleneck from agent capability to message routing.

The token cost is obvious. The accuracy cost is less discussed but likely larger. When an agent must reason over hundreds of messages from peers, most of which are irrelevant to its current task, signal-to-noise ratio drops and task performance follows.

RedundancyBench quantifies one dimension of this waste: even the best-performing method achieves only 24.88% accuracy at detecting redundant steps in agent trajectories. Some methods score worse than random guessing. Agents are not just ingesting irrelevant peer messages; they cannot reliably identify which of their own operations are unnecessary.

Agent-Radar: attention decay as routing

Agent-Radar, submitted May 28, 2026 by Hongxiang Zhang, Yuan Tian, and Tianyi Zhang, proposes a training-free fix. The method applies a combined temporal and spatial decay mechanism to each agent’s context window, steering attention toward messages that are both recent and relevant to the current task. The result: up to 7.64 absolute-point gains over state-of-the-art methods across five benchmarks.

Both decay axes do distinct work. Temporal decay down-weights messages by how many rounds ago they arrived, on the assumption that stale exchanges matter less once the conversation has moved on. Spatial decay down-weights messages by semantic distance from the agent’s current sub-task, so a message about a different branch of the problem gets less weight even if it just arrived. The reported ablation shows that removing either axis degrades the result, and that the two are most effective combined rather than stacked. That is the load-bearing claim: this is not just recency filtering with extra steps, because pure recency throws away an early message that is still the most relevant one in the buffer. [Updated June 2026]

Two properties make Agent-Radar worth tracking. First, it requires no additional training; the decay weights are applied at inference time. Second, an ablation study in the paper confirms that both the temporal and spatial components contribute to the gains. The method holds as agent count and interaction rounds increase, which is exactly the regime where broadcast topologies collapse.

AMRO-S: routing agents like ants

AMRO-S approaches the same problem from a different angle. Rather than filtering context within an agent’s history, AMRO-S models multi-agent routing as a semantic-conditioned path selection problem. It uses a supervised fine-tuned small language model for intent inference and task-specific pheromone specialists, borrowed from ant colony optimization, to route queries to the right agent.

The ant-colony framing sounds gimmicky, but the mechanism is concrete: pheromone trails encode which agents have historically performed well on which task types, and semantic conditioning prevents cross-task interference where routing decisions for one task type bleed into another. AMRO-S, submitted March 13, 2026, improves the quality-cost trade-off over strong baselines on five benchmarks and reports up to a 4.7x inference speedup, though the specific baselines are not detailed in the available abstract. [Updated June 2026]

The convergence with Agent-Radar is the point. One paper filters context at the agent level; the other routes queries at the system level. Both reject the broadcast-everything default.

When broadcast is fine, and what steering costs

Steering is not free, and the case for it is weakest exactly where most demos run. With three agents and one or two rounds, full broadcast costs almost nothing and skips an entire failure surface: a relevance filter that drops the one message an agent actually needed. Attention steering trades a token-and-noise problem for a routing-error problem, and a misrouted message is harder to debug than a verbose context window because the information was present and then thrown away before the agent ever saw it. The break-even point is roster size times rounds. Below it, broadcast wins on simplicity; above it, the quadratic growth in pairwise context makes the filter pay for itself.

There is also a coordination tax that steering does not address and can worsen. Council Mode cut multi-agent hallucination by 35.9% on HaluEval, but at 4.2x the token cost of a single pass, which is the opposite trade from Agent-Radar’s token savings. The two are not in conflict so much as they answer different questions: steering reduces the context each agent ingests, while deliberation protocols add structured rounds to catch errors. Stack them naively and the deliberation rounds regenerate the broadcast volume that steering was meant to suppress. Whether the relevance filter survives several rounds of cross-examination is exactly the regime neither paper tests.

Topology is the deeper variable, and it is not binary. Full broadcast and point-to-point routing are two ends of a spectrum that includes blackboard architectures, hierarchical supervisors, and gossip protocols. The question of whether agents even need a shared message bus is itself contested: work on decentralized context sharing shows agents can coordinate through local exchanges without a central coordinator, which moves the routing decision from a global broadcast graph to a set of pairwise edges. Agent-Radar’s decay function is one way to weight those edges; AMRO-S’s pheromone trails are another. Neither paper claims its topology is optimal, and the honest reading of this cluster is that the field has agreed broadcast is wrong without yet agreeing on what replaces it.

One caution applies to all of these methods. The decay weights, pheromone trails, and deliberation schedules are themselves hyperparameters, and the papers report results from the teams that tuned them. A relevance threshold that helps on multi-hop QA may hurt on a task where late context is decisive. Until these methods are stress-tested by groups that did not design them, the gains should be read as upper bounds for a well-tuned deployment, not defaults you can drop into a pipeline and expect to hold.

Agents that cannot audit themselves

The RedundancyBench results deserve a closer look, because they describe a failure mode that selective routing alone may not fix.

RedundancyBench tests whether agent systems can identify which steps in their own trajectories are redundant. The benchmark is built from roughly 200 agent trajectories with more than 8,000 human-annotated steps, each labeled redundant or necessary. The 24.88% top score means the best current method gets three out of four redundancy judgments wrong. If agents cannot recognize wasted work in their own execution traces, adding more agents to the system multiplies waste rather than dividing it.

This has direct implications for anyone building multi-agent pipelines. If your architecture assumes agents will self-prune unnecessary steps, RedundancyBench suggests that assumption is wrong. External mechanisms for trajectory auditing are necessary, and current methods for building those mechanisms are themselves unreliable.

What framework builders should change

The practical takeaway from this cluster of papers is straightforward. Framework builders, specifically those shipping multi-agent orchestration layers like CrewAI and AutoGen, should treat inter-agent message routing as a first-class design surface. The broadcast topology that ships by default is convenient for demos and expensive in production.

Two specific changes follow from the evidence:

Make routing topology configurable. Users should be able to specify which agents receive which message types, rather than relying on global fan-out. Agent-Radar’s temporal-spatial decay shows that even a simple relevance filter helps.
Add trajectory auditing. RedundancyBench demonstrates that agents cannot self-audit. Frameworks should expose hooks for external redundancy detection, even if current methods are only 24.88% accurate.

These do not require new research. They require treating the message graph between agents as an engineering concern rather than an afterthought.

What the MINDGAMES arena revealed about agent brittleness

The MINDGAMES arena, which evaluated 944 submitted agents from 76 teams across games like Colonel Blotto, Iterated Prisoner’s Dilemma, Codenames, and Secret Mafia, provides a complementary data point. The top-performing systems relied on explicit structural scaffolding rather than emergent reasoning from the agents themselves. Brittle rule adherence remained a major bottleneck across submissions.

This reinforces the routing argument from a different direction. If agents struggle to follow rules in structured game environments, adding more agents and broadcasting more messages is unlikely to produce emergent coordination. The scaffolding matters more than the agent count.

What to watch next

Agent-Radar and RedundancyBench landed on the same day (May 28, 2026), while AMRO-S has been available since March. The field is converging on inter-agent communication efficiency as a research priority, which suggests practitioner tooling will follow.

The open question is whether routing improvements get absorbed into the major frameworks or remain separate plug-in layers. Agent-Radar’s training-free approach makes it cheap to adopt. AMRO-S’s reliance on a fine-tuned routing model makes it harder to integrate but potentially more effective for domain-specific deployments. RedundancyBench points to a parallel gap: until agents can identify redundant steps in their own trajectories with better than 25% accuracy, even well-routed multi-agent systems will carry unnecessary computational overhead.

Frequently Asked Questions

Does selective routing fix the problem of agents producing contradictory outputs?

No. A separate study, Locally Coherent, Globally Incoherent, found that across 1,876 ensemble cliques on a four-LLM panel, 33 to 94 percent violated basic probability axioms even when each individual agent was locally coherent. Three intuitive LLM-side fixes (retrieval augmentation, partition-aware prompting, and a separate aggregator LLM) each failed or made things worse. The fixes that did work were not prompt-level at all: a deterministic hierarchical projection that snaps the joint distribution back onto the coherent set, and a sequential e-process for monitoring coherence over time. That is the lesson for routing. Routing decides which messages reach which agent. It does not reconcile logical contradictions among jointly incoherent conclusions; that requires a separate, non-LLM reconciliation step. [Updated June 2026]

Agent-Radar is training-free. Does that make it easier to adopt than AMRO-S?

Agent-Radar applies decay weights at inference time with no fine-tuning, so it can slot into an existing pipeline without retraining. AMRO-S requires a supervised fine-tuned small language model for intent inference plus per-task pheromone specialists, which adds a model to train and maintain. The tradeoff: AMRO-S can specialize to domain-specific deployments where a generic recency-and-relevance decay function underperforms, particularly in workflows where context usefulness depends on domain semantics rather than how recently a message was sent.

What does the 24.88 percent RedundancyBench score imply for production token budgets?

If the best method identifies only one in four redundant steps correctly, and some methods score below random, any pipeline that relies on agents self-pruning is almost certainly executing three to four times as many steps as necessary. The operational consequence is that token budgets should be set assuming significant waste, and external trajectory auditing should be treated as a dedicated cost center rather than a capability agents will develop on their own.

Could the routing layer itself become a bottleneck as agent count grows?

Agent-Radar computes relevance scores across message pairs and AMRO-S updates pheromone trails per routing decision. Both add a step whose cost rises with roster size. The MINDGAMES arena found that top systems relied on hand-engineered scaffolding rather than emergent reasoning, which suggests routing logic will also resist self-optimization. If the routing computation grows faster than linearly with agent count, the meta-orchestration layer becomes the new bottleneck, and the same poor redundancy detection that plagues agents could apply to routing decisions themselves.