A paper accepted at ACL 2026 Findings (https://arxiv.org/abs/2604.18005) establishes that multi-agent LLM brainstorming systems converge toward homogeneous outputs primarily because of structural coupling — shared base models, identical system prompts, and overlapping context windows — not because the communication graph is too dense. Across five models and five group sizes, the study finds that thinning the topology changes the rate of convergence without changing where the system lands if coupling at the model and prompt layer remains intact. (https://arxiv.org/pdf/2604.18005.pdf)
The Three Coupling Layers
The paper analyzes diversity collapse at three levels of multi-agent architecture. At the model layer, agents running on the same backbone produce semantically correlated outputs from the start, regardless of what personas or instructions are layered on top. At the cognition layer, authority structures suppress divergence: when agents are configured with hierarchical roles, the higher-authority agent’s framing dominates subsequent turns. At the system layer, group size and communication topology modulate how quickly the other two layers push the group toward a shared attractor. (https://arxiv.org/pdf/2604.18005.pdf)
These layers compound rather than substitute for one another. An agent pool running the same backbone across all N positions, with a shared system prompt and a dense round-robin topology, stacks all three coupling effects simultaneously. (https://arxiv.org/pdf/2604.18005.pdf)
Measured Diversity Collapse
The experiment tested group sizes N ∈ {3, 4, 5, 6, 7}, using DeepSeek-V3 as the primary backbone with cross-model replication on GPT-5.1, o1-mini, GPT-4o, and Claude-Sonnet-4. Embeddings used OpenAI text-embedding-3-large with BGE-large sensitivity checks. Each configuration ran 50 independent sessions across 20 topics, producing 1,000 proposals per setting. (https://arxiv.org/pdf/2604.18005.pdf)
Diversity was measured via Vendi Score, then normalized per agent into a Diversity Utilization Ratio (DUR = Vendi Score ÷ N). DUR fell from 1.03 at N=3 to 0.47 at N=7 — a 54% efficiency loss as group size grew. (https://arxiv.org/pdf/2604.18005.pdf) A seven-agent group produces less than half the diversity per agent that a three-agent group does, at greater coordination overhead.
Cognitive structure produced comparable effects on absolute Vendi Score. A flat, horizontal structure — agents configured with equal-authority, early-career personas — achieved the highest score at 8.080. An interdisciplinary structure scored lowest at 4.647, a 74% relative difference. (https://arxiv.org/pdf/2604.18005.pdf) The horizontal result is counterintuitive: uniform seniority outperforms the configuration explicitly designed to mix disciplinary backgrounds. The likely mechanism is authority suppression: mixed-seniority groups produce one voice that dominates rather than genuine recombination.
On topology, the results follow the same independence-over-density logic. Dense communication accelerated premature convergence. A subgroups topology — agents organized into insulated clusters before a merge step — sustained the highest Constructive Conflict Ratio in later discussion turns. Standard round-robin showed the lowest late-stage diversity of all topologies tested. (https://arxiv.org/pdf/2604.18005.pdf)
Why Thinning the Graph Does Not Fix the Problem
The conventional response to premature convergence in a multi-agent system is to reduce connections: fewer edges, smaller N, or a shift from a dense graph to a ring or chain. The subgroups topology result shows that this framing is wrong. Subgroups wins on late-stage Constructive Conflict Ratio not because it has fewer connections per se, but because it creates periods of structural isolation — agents within a subgroup develop positions without seeing outputs from the other subgroup. By the time groups merge, positions are already formed and resist assimilation. The active ingredient is independence, not sparsity. (https://arxiv.org/pdf/2604.18005.pdf)
If model and prompt coupling remain intact, topology changes rearrange the path to the same attractor. Stronger, highly aligned models amplify this: the paper’s model-level analysis finds that higher-capability models produce diminishing marginal diversity returns, because alignment training narrows the output distribution in ways that aggregate across homogeneous agent pools. (https://arxiv.org/abs/2604.18005)
What This Requires from Framework Developers
The three coupling layers map directly to configuration decisions users make in CrewAI, AutoGen, and LangGraph.
Model coupling is addressed by using heterogeneous backbones. The study’s cross-model replication across DeepSeek-V3, GPT-5.1, o1-mini, GPT-4o, and Claude-Sonnet-4 (https://arxiv.org/pdf/2604.18005.pdf) demonstrates that backbone heterogeneity is an achievable design choice. Frameworks that expose a single model parameter applied uniformly to all agents push users toward homogeneous configurations without signaling the diversity cost.
Prompt coupling requires more than assigning different role names to agents who share a common underlying instruction set. The cognitive structure data is direct on this: an interdisciplinary crew where one agent holds senior authority suppresses divergent outputs from the others. (https://arxiv.org/pdf/2604.18005.pdf) Differentiation must extend to authority framing, not just persona labels.
Context coupling requires partitioning what agents see before they are expected to disagree. The subgroups topology result functions as a proof of concept: agents that cannot read each other’s outputs during an early exploration phase produce more durable positions that survive into a later merging step. (https://arxiv.org/pdf/2604.18005.pdf) For frameworks that use shared state objects passed between agents at each turn, that shared state may function as a context coupling mechanism across agents that should be independently exploring before any convergence step.
The study’s scale — 1,000 proposals per configuration, replicated across five model families, with Vendi Score validated against BGE-large embeddings (https://arxiv.org/pdf/2604.18005.pdf) — makes the result harder to attribute to domain specificity or measurement noise. The practical reframe for pipeline designers: the question is not “how many agents, in what topology?” but “what do these agents share, and when?” Reducing shared structure at the model and prompt layer before addressing graph topology is the intervention the evidence supports.
Frequently Asked Questions
Does this diversity collapse finding apply to all multi-agent tasks, or only open-ended idea generation?
The paper focuses on open-ended idea generation and brainstorming. The structural coupling mechanism would likely behave differently in tasks where convergence on a single correct answer is the intended outcome.
How is structural coupling different from simply using too many agents?
Structural coupling is caused by shared models, prompts, and context windows, not group size alone. The paper shows a 54% diversity efficiency loss from N=3 to N=7, meaning even small homogeneous groups exhibit coupling and larger groups only amplify it.
What specific changes should teams make to CrewAI or LangGraph pipelines to avoid diversity collapse?
Teams should use heterogeneous model backbones per agent, differentiate system prompts and authority framing beyond persona labels, and partition shared state so agents explore independently before any merging step.
Can changing the communication topology fix diversity collapse on its own?
No. Topology is a second-order lever: the subgroups configuration helps because it enforces temporary isolation during exploration, giving positions time to form before any cross-pollination occurs. Without addressing model and prompt coupling first, any topology choice — sparse or dense — converges on the same homogeneous attractor, just at different speeds.
Do more capable models reduce diversity collapse?
The paper finds the opposite. Alignment training compresses the output distribution of stronger models into a narrower band of plausible responses, and when multiple agents run the same aligned backbone, those compressed distributions reinforce each other. Using a mix of models from different providers is a more reliable lever than upgrading to a single more capable model.