Do LLM Agent Societies Develop Their Own Authority Hierarchies?

Yes, they do. According to a June 2026 preprint, LLM agent societies run under minimal interaction rules, with no authority specified, spontaneously produce stable labor specialization, guanxi-based economic ethics, clan-based center-periphery stratification, and emergent relational authority. The hierarchies appear without being designed into the protocol.

What did the CAREB-MAS paper actually find?

The CAREB-MAS framework, detailed in arXiv:2606.23764, runs long-horizon simulations of LLM agents and reports that they “spontaneously reproduce five core Differential Order phenomena: stable labor specialization, guanxi-based economic ethics, relational decay of cooperation, emergent relational authority, and clan-based center-periphery stratification.” Labor specialization means agents settle into complementary roles; relational decay means cooperation erodes along the relational gradient; the guanxi and clan terms denote favor-exchange and center-periphery structures drawn from Chinese rural sociology. The underlying thesis is Fei Xiaotong’s mid-20th-century “Differential Order Pattern,” an account of how that society organized around concentric circles of relational obligation rather than formal institutions. CAREB-MAS is a social-simulation exercise testing that thesis against LLM agents, not a production-systems failure report, and the distinction matters for how far to push the conclusions.

The notable engineering point is what the macro environment does not specify. CAREB-MAS “specifies only individual production, preference-based allocation, and minimal interaction protocols.” Authority and stratification emerge anyway. Agents reason through an emotion-ethics-belief chain grounded in Affect Control Theory, Social Identity Theory, and Durkheimian collective affect, and they maintain dynamically evolving egocentric identities across the run. The paper frames prior LLM simulations as having “mainly addressed short-term coordination rather than long-horizon social structure,” and positions itself as the long-horizon counterweight. That framing is the hook for the rest of this article: the phenomena it reports only show up over long horizons.

The result lands amid a burst of June 2026 submissions to arXiv’s Multiagent Systems (cs.MA) listing, which carried roughly eighty recent entries across early June, and alongside an established survey canon that already reserves a thread for “Agent Society: From Individuality to Sociality” (LLM-Agent-Paper-List). The subfield is publishing at high volume just as this result appears.

Does emergent authority threaten an explicit message bus?

The honest answer is that the paper does not say, and no source examined here discusses MCP, A2A, or any named coordination protocol. The claim that emergent dominance overrides your message bus or voting scheme is an extrapolation, and should be read as one.

The extrapolation runs like this. If agents in long-horizon runs defer along a relational gradient, then the explicit coordination topology a builder specifies is not the only coordination layer running. A deference gradient could shape which agent’s proposals get taken up, who initiates, who is overruled, and which node becomes a coordination bottleneck, all without any code ordaining it. “Emergent relational authority” denotes sociological stratification rather than a single de-facto authority node, so the failure mode is more diffuse than “one agent takes over.” But the systems-engineering implication holds: behavior a builder assumed was fully determined by the orchestration graph may be partly determined by interaction history instead.

If the extrapolation holds, it breaks a load-bearing assumption. Most coordination-protocol designs, and most of the testing written for them, proceed as though the protocol is the whole coordination story. If a deference layer forms underneath, two systems can share an identical topology and still behave differently, because their interaction histories have stratified along different lines. The protocol would still be running, technically intact. It just would not be the thing actually settling decisions.

This is the part that matters for builders and the part the paper does not prove. Treat it as a hypothesis worth testing rather than a settled result.

Where is the testing gap in coordination-protocol design?

The testing gap is one of regime mismatch: most coordination protocols are validated against short functional demonstrations, while the stratification CAREB-MAS reports only surfaces across large populations and over many turns. It is a long-horizon effect built from accumulated interaction, not a property visible in a three-turn smoke test.

The consequence is that a demo passing in a short burst tells you nothing about whether one node silently accrues deference over a thousand-turn run. The conditions for accrual have not been met, so you are testing a different system than the one you ship. The collective-behaviour evaluation in arXiv:2602.16662 makes the same point from a different angle, scaling to “populations of hundreds of agents.”

What happens to collective behavior at hundreds of agents?

The hundreds-of-agents evaluation in arXiv:2602.16662 warns of “a significant risk of convergence to poor societal equilibria, particularly when the relative benefit of cooperation diminishes and population sizes increase.” Two things matter. The effect scales with population size, which reinforces the case for population-dependent testing. And newer models do worse than older ones on this axis, an inverted-progress result that should make a builder skeptical of assuming an upgrade fixes a coordination problem. The same logic applies to emergent authority: if pathologies strengthen with population, a small-population test is not a conservative proxy for a large-population deployment. It is a different regime.

The same work encodes LLM-generated strategies as algorithms “enabling inspection prior to deployment.” That is a probe pattern worth borrowing: render the strategy an agent claims to follow as something legible, then check whether the executed behavior matches it across large populations. If behavior diverges from the declared strategy, you have a direct measurement of the gap between protocol-as-specified and protocol-as-run, which is precisely the gap an emergent-authority effect would widen.

How does the governed-memory parallel reinforce this?

The closest production-systems analogue in the recent literature is arXiv:2606.24535, “Governed Shared Memory for Multi-Agent LLM Systems.” Its core claim is that “long-context retrieval alone is insufficient for production multi-agent memory,” and that governed shared memory “demands explicit systems-level abstractions.” It formalizes four failure modes, including unauthorized leakage and contradiction persistence, the latter being the case where conflicting state persists because nothing reconciles it.

The sharpest line for this argument is methodological: “live evaluation is vital to expose enforcement and pipeline-ordering failures missed by design-only treatments.” That is the same claim the CAREB-MAS result implies for coordination protocols. Design-only reasoning, where you argue from the topology that the system will behave, misses the failures that only appear when the system actually runs. The memory paper makes this point about data leakage and ordering; the authority result implies it about deference and stratification. The shared lesson is that design-time correctness is not runtime correctness in multi-agent systems, and the distance between the two grows with population and horizon.

Hierarchical multi-agent reinforcement learning in arXiv:2606.24010 provides the deliberate-design contrast: researchers there engineer explicit high-level and low-level authority for safety guarantees. The difference is instructive. When you engineer the hierarchy, you know where it sits and you can bound it. When it emerges, you do not, and the guarantees you can make about a designed hierarchy do not transfer to one you never specified.

What should builders change in their multi-agent test harnesses?

If you take the extrapolation seriously, the harness changes are specific.

Run long. Stratification is a long-horizon effect. A coordination-protocol test that only covers short bursts cannot see it. Add a long-run, many-turn configuration alongside the functional demo.
Run at fleet size. The poor-equilibria result scales with population size. Test at the population you expect in production, not a two-agent stub.
Measure deference, not just correctness. Track which agent’s proposals get adopted, who initiates, who is overruled, and whether that distribution stabilizes into a gradient. A flat topology that develops a persistent deference gradient is the signal.
Inspect declared strategy against executed strategy. Borrow the encode-strategy-as-algorithm probe from arXiv:2602.16662. If the strategy an agent claims to follow diverges from what it does across large populations, you have measured the gap directly.
Treat live evaluation as primary. Per arXiv:2606.24535, design-only treatments miss enforcement and ordering failures. The same applies to emergent authority: argue from the topology, then verify by running.

None of this is proven by CAREB-MAS, which is a sociology simulation. What the paper does give you is a reason to suspect that the coordination layer you did not write can develop structure of its own, and a concrete set of phenomena to look for once you run long enough to see them.

Frequently Asked Questions

Where does engineered hierarchy in multi-agent RL fit against this?

The hierarchical RL study arXiv:2606.24010 uses constraint manifold control to build an explicit high-level/low-level authority split for safety guarantees, so designers know where the hierarchy sits and can bound it. CAREB-MAS reports stratification appearing under rules that specify only individual production and preference-based allocation, with no level structure ordained. Two June 2026 neighbors, ASALT (arXiv:2606.24601) on RL lateral transfer and OpenThoughts-Agent (arXiv:2606.24855) on training-data recipes, similarly presuppose a designed topology.

What is the strongest reason to doubt the production warning?

The source is a 37-page social-simulation preprint submitted by Zhiyuan Ji on 22 June 2026 that tests Fei Xiaotong’s Differential Order Pattern against prompted LLM agents, not a field report from a deployed fleet. Its ACL 2026 Findings acceptance is author-reported and unverified, and no source examined discusses MCP, A2A, or any named coordination protocol. The claim that emergent deference overrides an orchestration bus is the writer’s extrapolation, one the paper does not make.

Does the spontaneous-stratification result extend to RL-trained fleets?

CAREB-MAS confines itself to LLM agents reasoning through an emotion-ethics-belief chain with dynamically evolving egocentric identities, a prompted social-role regime. The June 2026 cs.MA entries that touch trained fleets, OpenThoughts-Agent on data recipes and ASALT on lateral policy transfer, operate on learned policies rather than prompted roles, so the stratification claim does not automatically carry over to RL-trained fleets.

Does the June 2026 multi-agent literature already cover emergent authority broadly?

Most current coverage of multi-agent LLM work sits at the benchmark or framework level, on agent-training data recipes, RAG privacy rewriting, MARL transfer, and safety-constrained hierarchical RL. Long-horizon social structure, and the systems question of whether emergent deference destabilizes a specified topology, remain sparsely addressed, which is the opening CAREB-MAS and the governed-memory study fill.