Combining LLMs Doesn't Escape Shared Failures: A 67-Model Test

Numbers confirmed. Writing the article now.

Josef Liyanjun Chen’s arXiv:2606.27288, submitted June 25, 2026, establishes a hard ceiling on what model ensembles can achieve: for any policy that selects one member model’s answer (routing, voting, cascades, mixture-of-agents), accuracy cannot exceed 1 − β, where β is the rate at which every model in the pool is wrong on the same query. Tested across 67 frontier models from 21 providers, the paper finds that standard correlation diagnostics consistently underprice this all-wrong rate, and that the main levers practitioners reach for (add more models, add diverse providers) do not reliably move β.

What is the co-failure ceiling, and why can’t routing or voting escape it?

The ceiling is structural rather than empirical. Any system that selects one model’s answer can only be right when at least one model in the pool is right on that specific query. The rate at which no model is right is β. As arXiv:2606.27288 formalizes it, accuracy ≤ 1 − β holds for any member-selection policy, regardless of how sophisticated the selection mechanism is. A router with perfect query-to-model matching would still be bounded by this ceiling.

β is not a product of individual error rates in isolation. It is a joint property of the model pool on each specific query. When models fail on the same queries (the co-failure pattern the paper characterizes), β can run substantially higher than independent error rates would predict, and substantially higher than the diagnostic most practitioners use would indicate. Adding more models from more providers does not lower β if those models share a common failure mode on the queries that matter.

Why does pairwise error correlation (ρ) hide the all-wrong tail?

The standard diagnostic for ensemble quality is average pairwise error correlation (ρ): low ρ implies diverse failures, which implies the ensemble is safe to rely on. arXiv:2606.27288 shows this inference has a structural hole.

Error distributions with identical marginals and identical pairwise correlations can have different all-wrong rates. Low ρ is necessary but not sufficient for independence at the tail. The all-wrong rate β occupies a region of joint error distribution space that pairwise statistics do not index. A pool of models with genuinely low pairwise correlations can still share a significant set of queries where all of them are simultaneously wrong, and ρ will not flag it.

The quantitative demonstration is in the mathematics results. The paper fits a tetrachoric-calibrated single-factor model to the 67-model pool and compares its implied β to the observed value. On open-ended mathematics, the single-factor model predicts β = 0.023, while the observed β is 0.052, approximately 2.5x underpricing (90% CI 1.7 to 3.4, k = 17 benchmark sets). A reliability framework calibrated to the single-factor model’s implied ceiling would be working from a number that underprices actual tail risk by a factor of 2.5.

What do the 67-model numbers actually show?

arXiv:2606.27288 reports β across three task types, each with a distinct measurement approach.

On open-ended mathematics: observed β = 0.052 versus the Gaussian copula prediction of 0.023, a 2.5x gap (90% CI 1.7 to 3.4, k = 17 benchmark sets). On execution-graded code: observed β = 0.079. Code evaluation removes annotator ambiguity entirely; the test suite either passes or it does not. Even on this verifiable task type, roughly 8% of queries produced a simultaneous failure across all 67 models.

On GPQA-Diamond re-asked in free-response format: β = 0.127, evaluated by a five-judge panel with inter-annotator agreement κ = 0.73 to 0.92. GPQA-Diamond is the graduate-level science benchmark widely used to evaluate frontier models; finding β above 0.12 there is not a niche-task artifact.

All three are point estimates on a specific pool and task distribution. These numbers will shift as the model population changes. The paper is v1 as of June 25, 2026, and the β values carry stated confidence intervals rather than claiming settled consensus.

Does answer format matter more than subject domain?

The GPQA result points to a specific mechanism. The benchmark covers graduate-level biology, chemistry, and physics in both conditions, same subject matter, same 67 models, different format. In multiple-choice format, co-failure is structurally constrained: to fail together, models must converge on the same specific wrong option from a finite set. In free-response format, that structural constraint disappears.

The arXiv:2606.27288 finding that β reopens to 0.127 on free-response GPQA locates the driver in format, not subject. A multi-model system that runs the same question through N models in the same query format and calls that diverse verification may be constructing a pool with a substantially higher β than its provider diversity implies. Varying query format is at least as important as varying model selection.

This has a practical corollary for deployment. Systems that surface identical prompts to multiple models because they want independent assessment are likely measuring something closer to “do models converge on the same wrong answer in the same way” than “do models actually fail independently.” Format homogeneity is a co-failure amplifier that sits upstream of model selection.

Why does adding more models stop helping on checkable tasks?

On tasks where correctness is verifiable, combining models rarely beats the single best model in the pool without a query-level routing signal that identifies which model performs best on each query type. According to arXiv:2606.27288, gains from ensembling on checkable tasks come from models failing on different questions, not from expanding pool size once co-failure dominates the ceiling.

The paper compares ensemble strategies directly: at matched quality, low-ρ heterogeneous ensembles outperform high-ρ Self-MoA architectures arXiv:2606.27288. Self-MoA produces high pairwise correlation by construction, the same base model refines its own output through multiple passes. Running additional passes does not lower β; it adds inference cost while leaving the all-wrong rate roughly intact. The diversity that moves β is diversity in which questions a model gets wrong, not diversity in architecture labels or provider names.

For practitioners, the paper’s Clopper-Pearson procedure offers a usable tool: a confidence bound on β from a finite-sample calibration set yields a certificate on the maximum gain any router, vote, or cascade could deliver. That certificate is computable before training a router, which means the ceiling audit can precede the architecture commitment rather than arriving as a post-deployment surprise.

What does this mean for governance frameworks built on multi-model agreement?

A common pattern in AI governance proposals is redundant model checking: send a high-risk output through multiple models, require agreement before proceeding, treat disagreement as a veto. The implicit assumption is that diverse models provide independent oversight. arXiv:2606.27288 makes that assumption expensive to defend.

On a task domain with β = 0.127, a three-model agreement check has a baseline probability of 12.7% of producing unanimous agreement on a wrong answer. The governance framework gets to claim three checks ran; it does not get to claim they were independent. That is correlated failure expressed through a voting layer, not redundancy.

A parallel preprint, arXiv:2606.26298 on institutional attestation (submitted June 24, 2026), takes a structurally different approach. Under that model, an agent retains full autonomy over planning and reasoning but holds no execution authority over high-risk actions. Execution requires preconditions independently attested by separate authoritative sources, cryptographically bound to intent and logged tamper-evidently. The independence is enforced architecturally, not inferred from model diversity. Whether a model agrees with another model is beside the point; the attestation layer does not consult it.

Mainstream multi-agent frameworks including AutoGen, CrewAI, LangGraph, and MetaGPT use fixed or learned topologies optimized for accuracy or throughput. Recent work like Differentiable Mixture-of-Agents (arXiv:2605.15706) adds dynamic per-step routing for performance gains. Neither line minimizes β as an explicit design objective or characterizes the all-wrong rate as a reliability constraint. Adding more capable models to these frameworks raises accuracy ceilings; it does not lower co-failure rates unless those models also fail on structurally different query types.

What should practitioners measure and demand instead?

Measure β directly on a representative task distribution, using the Clopper-Pearson procedure from arXiv:2606.27288 to bound it from finite samples. The ρ-based diagnostic cannot identify β; the paper proves they are not interchangeable. Using ρ to infer tail independence is the specific failure mode the paper characterizes and quantifies.

Vary query format alongside model selection. The GPQA result identifies format as a co-failure driver at least as significant as provider diversity. A pool of models from different providers, all receiving the same query framing, may share a common failure pattern that provider diversity alone does not dissolve. Build held-out format variation into the calibration set used to estimate β.

Treat multi-model agreement as a correlated check, not an independent one, when the task domain has measurable co-failure. For decisions where correctness matters enough to require genuine oversight, the institutional attestation model from arXiv:2606.26298 represents a different design philosophy: independence enforced at the architecture level rather than assumed from model diversity.

The theorem in arXiv:2606.27288, accuracy ≤ 1 − β for any member-selection policy, does not depend on which 67 models were tested or which benchmarks were used. The specific β values will shift as the model landscape shifts. The constraint will not.

Frequently Asked Questions

Does the co-failure ceiling apply to ensemble methods that synthesize outputs rather than selecting one model’s answer?

No. The theorem bounds any policy that selects a single member model’s output: routing, majority vote, and cascade fallback. Synthesis approaches that merge logits, distill outputs, or run critic-refinement loops averaging across models fall outside its formal scope. Those methods face different failure modes: a synthesizer trained on wrong model outputs can still produce wrong outputs, and its reliability depends on the synthesizer itself, not just the member pool’s β.

How does the β result relate to the Condorcet jury theorem, which predicts ensemble accuracy approaches 100% as pool size grows?

The Condorcet jury theorem requires each voter to fail independently. When β is nonzero, there is a fixed floor of queries where every model fails simultaneously, and adding more models to the pool does not reduce that floor if they share the same blind spots. The Condorcet prediction is structurally inapplicable to any LLM pool where β exceeds zero, which arXiv:2606.27288 shows is the case across all three task types tested.

What data does measuring β require that a pairwise correlation diagnostic does not?

Computing ρ only needs pairwise error comparisons between model outputs, derivable from any test set without ground-truth labels if models are compared against each other. Computing β requires identifying the all-wrong cases, which means a verified ground-truth label for each query. For tasks without checkable answers, that requires a human annotation pass or a reliable judge. The practical implication: β measurement is only straightforward on tasks with verifiable correctness, which limits its use on open-ended generation.

What infrastructure does institutional attestation require that a multi-model vote does not?

A multi-model vote adds inference cost proportional to pool size and requires no new infrastructure. The institutional attestation model from arXiv:2606.26298 requires attestors that are separate authoritative sources, cryptographic signing infrastructure to bind preconditions to intent, and a tamper-evident audit ledger. The overhead is closer to deploying a PKI than calling a model API. That cost becomes easier to justify when co-failure evidence shows that multi-model agreement on high-risk actions is correlated, not independent.