Multi-agent LLM systems have a scaling problem, and it is not the one most people assume. The bottleneck is not how capable each agent is, but how much irrelevant context each one ingests from every other agent in the roster. Two papers released in the same week of May 2026, plus a March submission whose findings point in the same direction, converge on the same diagnosis: broadcast-everything message topologies waste tokens, degrade accuracy, and break coherence in ways that more capable individual agents cannot fix.
Why full broadcast breaks under load
The default architecture in most multi-agent frameworks is simple: every agent sends every message to every peer. This works for three-agent demos. At five or more agents with multiple interaction rounds, the conversation history each agent must process grows rapidly. Agent-Radar identifies the core symptom: relevant information gets diluted by irrelevant context as conversations lengthen, shifting the bottleneck from agent capability to message routing.
The token cost is obvious. The accuracy cost is less discussed but likely larger. When an agent must reason over hundreds of messages from peers, most of which are irrelevant to its current task, signal-to-noise ratio drops and task performance follows.
RedundancyBench quantifies one dimension of this waste: even the best-performing method achieves only 24.88% accuracy at detecting redundant steps in agent trajectories. Some methods score worse than random guessing. Agents are not just ingesting irrelevant peer messages; they cannot reliably identify which of their own operations are unnecessary.
Agent-Radar: attention decay as routing
Agent-Radar, submitted May 28, 2026, proposes a training-free fix. The method applies a combined temporal and spatial decay mechanism to each agent’s context window, steering attention toward messages that are both recent and relevant to the current task. The result: up to 7.64 absolute-point gains over state-of-the-art methods across five benchmarks.
Two properties make Agent-Radar worth tracking. First, it requires no additional training; the decay weights are applied at inference time. Second, an ablation study in the paper confirms that both the temporal and spatial components contribute to the gains. The method holds as agent count and interaction rounds increase, which is exactly the regime where broadcast topologies collapse.
AMRO-S: routing agents like ants
AMRO-S approaches the same problem from a different angle. Rather than filtering context within an agent’s history, AMRO-S models multi-agent routing as a semantic-conditioned path selection problem. It uses a supervised fine-tuned small language model for intent inference and task-specific pheromone specialists, borrowed from ant colony optimization, to route queries to the right agent.
The ant-colony framing sounds gimmicky, but the mechanism is concrete: pheromone trails encode which agents have historically performed well on which task types, and semantic conditioning prevents cross-task interference where routing decisions for one task type bleed into another. AMRO-S improves the quality-cost trade-off over strong baselines on five benchmarks, though the specific baselines and benchmarks are not detailed in the available abstract.
The convergence with Agent-Radar is the point. One paper filters context at the agent level; the other routes queries at the system level. Both reject the broadcast-everything default.
Agents that cannot audit themselves
The RedundancyBench results deserve a closer look, because they describe a failure mode that selective routing alone may not fix.
RedundancyBench tests whether agent systems can identify which steps in their own trajectories are redundant. The 24.88% top score means the best current method gets three out of four redundancy judgments wrong. If agents cannot recognize wasted work in their own execution traces, adding more agents to the system multiplies waste rather than dividing it.
This has direct implications for anyone building multi-agent pipelines. If your architecture assumes agents will self-prune unnecessary steps, RedundancyBench suggests that assumption is wrong. External mechanisms for trajectory auditing are necessary, and current methods for building those mechanisms are themselves unreliable.
What framework builders should change
The practical takeaway from this cluster of papers is straightforward. Framework builders, specifically those shipping multi-agent orchestration layers like CrewAI and AutoGen, should treat inter-agent message routing as a first-class design surface. The broadcast topology that ships by default is convenient for demos and expensive in production.
Two specific changes follow from the evidence:
Make routing topology configurable. Users should be able to specify which agents receive which message types, rather than relying on global fan-out. Agent-Radar’s temporal-spatial decay shows that even a simple relevance filter helps.
Add trajectory auditing. RedundancyBench demonstrates that agents cannot self-audit. Frameworks should expose hooks for external redundancy detection, even if current methods are only 24.88% accurate.
These do not require new research. They require treating the message graph between agents as an engineering concern rather than an afterthought.
What the MINDGAMES arena revealed about agent brittleness
The MINDGAMES arena, which evaluated 944 submitted agents from 76 teams across games like Colonel Blotto, Iterated Prisoner’s Dilemma, Codenames, and Secret Mafia, provides a complementary data point. The top-performing systems relied on explicit structural scaffolding rather than emergent reasoning from the agents themselves. Brittle rule adherence remained a major bottleneck across submissions.
This reinforces the routing argument from a different direction. If agents struggle to follow rules in structured game environments, adding more agents and broadcasting more messages is unlikely to produce emergent coordination. The scaffolding matters more than the agent count.
What to watch next
Agent-Radar and RedundancyBench landed on the same day (May 28, 2026), while AMRO-S has been available since March. The field is converging on inter-agent communication efficiency as a research priority, which suggests practitioner tooling will follow.
The open question is whether routing improvements get absorbed into the major frameworks or remain separate plug-in layers. Agent-Radar’s training-free approach makes it cheap to adopt. AMRO-S’s reliance on a fine-tuned routing model makes it harder to integrate but potentially more effective for domain-specific deployments. RedundancyBench points to a parallel gap: until agents can identify redundant steps in their own trajectories with better than 25% accuracy, even well-routed multi-agent systems will carry unnecessary computational overhead.
Frequently Asked Questions
Does selective routing fix the problem of agents producing contradictory outputs?
No. A separate study on compositional incoherence found that 33 to 94 percent of multi-agent ensemble cliques violate basic probability axioms even when each individual agent is locally coherent. Three intuitive fixes (retrieval augmentation, partition-aware prompting, and a separate aggregator LLM) each failed or made things worse. Routing decides which messages reach which agent. It does not reconcile logical contradictions that emerge when multiple agents produce jointly incoherent conclusions.
Agent-Radar is training-free. Does that make it easier to adopt than AMRO-S?
Agent-Radar applies decay weights at inference time with no fine-tuning, so it can slot into an existing pipeline without retraining. AMRO-S requires a supervised fine-tuned small language model for intent inference plus per-task pheromone specialists, which adds a model to train and maintain. The tradeoff: AMRO-S can specialize to domain-specific deployments where a generic recency-and-relevance decay function underperforms, particularly in workflows where context usefulness depends on domain semantics rather than how recently a message was sent.
What does the 24.88 percent RedundancyBench score imply for production token budgets?
If the best method identifies only one in four redundant steps correctly, and some methods score below random, any pipeline that relies on agents self-pruning is almost certainly executing three to four times as many steps as necessary. The operational consequence is that token budgets should be set assuming significant waste, and external trajectory auditing should be treated as a dedicated cost center rather than a capability agents will develop on their own.
Could the routing layer itself become a bottleneck as agent count grows?
Agent-Radar computes relevance scores across message pairs and AMRO-S updates pheromone trails per routing decision. Both add a step whose cost rises with roster size. The MINDGAMES arena found that top systems relied on hand-engineered scaffolding rather than emergent reasoning, which suggests routing logic will also resist self-optimization. If the routing computation grows faster than linearly with agent count, the meta-orchestration layer becomes the new bottleneck, and the same poor redundancy detection that plagues agents could apply to routing decisions themselves.