More Agents, Worse Results: Why Multi-Agent LLM Teams Hold Experts Back

Adding a second agent to a team sounds like a free upgrade. The evidence now says it is a tax. arXiv:2602.01011, accepted at ICML 2026, shows LLM teams consistently underperform their strongest member by 6 to 41.1 percentage points on ML benchmarks. A separate study on prediction-market oracles and a taxonomy of 1,600+ failure traces from production multi-agent systems converge on the same finding: the committee is not smarter than the expert sitting in it.

The accuracy penalty

arXiv:2602.01011 (“Multi-Agent Teams Hold Experts Back,” v4 updated 2026-05-28) measures what happens when you form a team of LLM agents and let them coordinate freely on ranking, multiple-choice, and short-answer tasks. The result is unambiguous: the team’s aggregate answer is worse than the best individual agent’s answer, and the gap scales with team size.

The headline numbers are a 6 to 41.1 percentage-point loss relative to the best team member across standard ML benchmarks. The authors tested both homogeneous teams (same model, different instances) and heterogeneous teams (mixed model families via OpenAI, Anthropic, and OpenRouter backends). The pattern held in both configurations. Teams were even told which agent was the expert. They still averaged the expert’s output with weaker agents’ outputs and landed at a worse answer.

The paper ships an open-source toolkit that supports replication across all three task types with configurable team size, information-distribution modes, and mixed backend selection.

Why the expert gets dragged down

The bottleneck is not identification. The teams generally figure out which member is strongest. The bottleneck is what the authors call “expert leveraging”: the integrative tendency to treat every agent’s contribution as partially valid and synthesize a compromise position rather than deferring to the agent with the best track record.

This compromise behavior intensifies as the team grows. A three-agent team averages more aggressively than a two-agent team, and the performance penalty correlates with that averaging. The mechanism is straightforward if you have watched a meeting of competent engineers who all feel obliged to contribute: the strongest signal gets diluted by weight-averaging against weaker but confident signals.

There is no evidence in the paper that prompting strategies (instructing agents to weight expert input more heavily) close the gap. The integrative-compromise tendency appears to be a structural property of LLM deliberation under unconstrained coordination, not a surface-level behavior a system prompt can patch.

Deliberative consensus actively hurts accuracy

A separate line of evidence arrives at the same conclusion from a different domain. arXiv:2605.30802 evaluates multi-agent AI oracle systems for prediction-market resolution, a task where accuracy is directly measurable and the cost of confident error is high.

Deliberative multi-agent consensus, where models debate a resolution before settling on an answer, degraded accuracy to roughly 76%. That figure sits below every single-model baseline tested: GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B all beat the committee when run independently. The mechanism is familiar from the ICML paper: confidently wrong models flipped correct ones during debate.

The prediction-market study did find one setup that beat single-model baselines: independent aggregation with confidence-weighted voting, which achieved 83.43% accuracy, a 1.01 percentage-point gain over the best single model. But even this ensemble ceiling is modest. Error correlations between models ranged from 0.529 to 0.689, placing a hard limit on ensemble gains well below the theoretical Condorcet ceiling. When models make the same mistakes, voting cannot average those mistakes away.

Five MAD frameworks vs. a single prompt

An evaluation of five multi-agent debate frameworks (MAD, Multi-Persona, EoT, ChatEval, AgentVerse) across nine benchmarks found that none consistently outperformed simple single-agent Chain-of-Thought or Self-Consistency. The numbers are blunt: GPT-4o-mini with CoT scored 80.73 on MMLU. The best MAD method scored 80.40. The committee cost more inference tokens and delivered a lower number.

This is not an isolated result. The pattern repeats across benchmarks in the evaluation. Multi-agent debate occasionally matches single-agent CoT on specific tasks but never establishes a reliable, cross-benchmark advantage that would justify the additional compute and orchestration overhead.

The MAST failure taxonomy

arXiv:2503.13657 (“Why Do Multi-Agent LLM Systems Fail?”) provides the diagnostic layer. The authors annotated 1,600+ execution traces from seven multi-agent system frameworks and identified 14 distinct failure modes, clustered into three categories:

System design issues: misconfigured agent roles, broken handoff protocols, missing termination conditions
Inter-agent misalignment: agents pursuing conflicting objectives, redundant work, communication failures
Task verification gaps: no validation step, incorrect acceptance criteria, silent output corruption

The inter-annotator agreement was kappa = 0.88, which is high enough to treat the taxonomy as reliable. The practical takeaway is that multi-agent failures are not random noise; they are reproducible, categorizable, and concentrated in design and alignment problems that single-agent architectures simply do not have.

The one real upside: adversarial robustness

The ICML paper identifies a genuine trade-off. The same consensus-seeking behavior that drags down expert performance also makes teams more robust to adversarial agents. When one team member is compromised or explicitly trying to steer the group toward a wrong answer, the averaging effect acts as a dampener. The team’s output moves toward the center of the group’s distribution rather than snapping to the adversarial agent’s target.

This finding matters directly for safety and alignment work, where adversarial inputs are a primary threat model. It is not a good reason to use multi-agent teams for tasks where the goal is maximum accuracy on well-defined benchmarks. The two objectives, accuracy and robustness to adversarial manipulation, pull in opposite directions under current architectures.

Practical guidance for practitioners

As of June 2026, the default assumption in CrewAI, AutoGen, and LangGraph documentation and tutorials is that adding agents to a pipeline is an improvement. The evidence from three independent research streams says the opposite for unconstrained, deliberative team configurations.

A few concrete rules:

Benchmark before shipping a crew. Run the same task through your best single agent and through the multi-agent pipeline. If the crew does not win by a margin larger than the additional inference cost justifies, drop the crew.
Distinguish deliberative teams from fixed-pipeline chains. The ICML paper studies self-organizing teams with unconstrained coordination. A rigid planner-coder-reviewer pipeline with strict handoffs and no back-channel debate may behave differently. The failure modes from the MAST taxonomy still apply, but the expert-dragdown effect may be attenuated when agents cannot rewrite each other’s intermediate outputs.
If you use ensembles, use independent voting, not debate. The prediction-market study shows confidence-weighted independent aggregation beats every deliberative setup. Keep the agents isolated and aggregate their outputs post-hoc.
Treat multi-agent as opt-in, not default. The burden of proof should sit on the team topology: it needs to demonstrate a measurable per-task delta over a single strong agent. Anything else is paying for worse results.

The research does not say multi-agent architectures are useless. It says the default configuration most practitioners reach for, a freely coordinating team of agents debating toward consensus, is the wrong default. The correct default is a single strong model with task-specific routing, and multi-agent coordination should be an explicit optimization that justifies itself with numbers.

Frequently Asked Questions

Do these results apply to rigid pipeline chains where agents hand off work sequentially?

The ICML paper measures self-organizing teams with unconstrained coordination, not planner-coder-reviewer chains with strict handoffs, so the 6 to 41.1 percentage-point dragdown may be attenuated in rigid pipelines. However, the MAST taxonomy’s 14 failure modes specifically include broken handoff protocols and missing termination conditions that apply to sequential chains. Seven production-grade MAS frameworks were studied in that taxonomy, so the failure catalog extends well beyond freely debating teams.

What inference-cost multiplier does a multi-agent setup carry versus a single model?

A freely debating team incurs multiple turns of inter-agent communication on top of each agent’s base inference. The prediction-market study found that deliberative consensus dropped accuracy to roughly 76% while requiring every model to process the full debate transcript at each round. Independent confidence-weighted voting achieved 83.43% with zero inter-agent tokens, making it both cheaper and more accurate.

Were generative tasks like code generation or retrieval-augmented pipelines tested?

The three converging studies cover ranking, multiple-choice, short-answer benchmarks, and prediction-market resolution. None evaluate code generation, long-form prose, or tool-use workflows where a specialized retrieval agent might compensate for a weaker generator. The MAST taxonomy does annotate tool-use traces, but the expert-dragdown measurements are confined to evaluation-style tasks where correctness is cleanly measurable.

Could better prompting or a future model generation fix the integrative-compromise problem?

The ICML paper tested explicit prompting that identified the expert agent to the team, and the team still averaged the expert’s output with weaker contributions. The authors characterize the compromise tendency as a structural property of LLM deliberation under unconstrained coordination, not a surface behavior a system prompt can patch. This suggests the dragdown would persist across model generations unless architectures introduce explicit deference or calibration mechanisms.