Distributed Training Breaks the Compute Thresholds Behind AI Regulation

Every major AI governance framework in effect today ties its regulatory trigger to a single number: cumulative training FLOPs. The EU AI Act presumes “high impact capabilities” when a model is trained above 10^25 FLOP (Article 51). The US Executive Order 14110 set a reporting threshold at 10^26 FLOP, but it was revoked on the first day of the Trump administration in January 2025 and its replacement order named no successor figure, so there is currently no federal compute-threshold trigger at all [Updated June 2026] (Removing Barriers to American Leadership in AI). California’s SB 53, the Transparency in Frontier Artificial Intelligence Act signed in September 2025, keeps the 10^26 FLOP bar but uses it to compel safety disclosures rather than to cap training, with civil penalties up to $1 million per violation [Updated June 2026]. A paper published on arXiv in May 2026 argues that distributed training algorithms have advanced far enough to make those thresholds an accounting fiction.

The argument is mechanical, not speculative: if you split a training run across enough small clusters, no individual cluster crosses the FLOP cap, even though the combined run produces a model that would have triggered the threshold if trained in one place. The paper models this under adversarial constraints and concludes that frontier-scale models are, in principle, trainable on dispersed nodes connected by consumer-grade internet (arXiv:2605.29359).

How Communication-Efficient Training Shrinks the Bandwidth Wall

Standard distributed training assumes high-bandwidth, low-latency interconnects. Recent algorithmic advances compress inter-node gradient transfers to the point where frontier-scale training becomes theoretically possible on bandwidth orders of magnitude below datacenter standard (arXiv:2605.29359). The technique works by performing more local gradient updates before synchronizing, trading compute for communication.

A separate empirical result reinforces the point. Work on distributed training under packet loss demonstrated that LLAMA2 7B trained across 64 GPUs can tolerate 10% random packet loss with at most 0.8% change in perplexity (arXiv:2507.07114). That finding addresses a different dimension of unreliable networks but supports the same conclusion: distributed training is robust on infrastructure that would have been considered unusable even a few years ago.

The Empirical Record Has Caught Up With the Model

When the paper was drafted, the strongest public result on training over unreliable networks was a 7-billion-parameter run with simulated packet loss. That floor has risen sharply since, and on real dispersed hardware rather than in simulation [Updated June 2026]. Prime Intellect trained INTELLECT-1, a 10B model, across nodes on three continents using DiLoCo, then released INTELLECT-2 in May 2025: a 32B reasoning model trained by globally distributed reinforcement learning over a heterogeneous, free-joining pool of contributed GPUs (Prime Intellect). Nous Research’s Psyche network, built on its DisTrO optimizer and coordinated through a Solana smart contract, drove a 40B base model to 20 trillion tokens entirely over the internet, which it reports as the largest publicly verifiable decentralized pre-training run to date. In March 2026, the Covenant-72B project pre-trained a 72-billion-parameter model across open, permissionless peers using the SparseLoCo optimizer, with peers joining and leaving mid-run.

Two caveats keep this short of vindicating the paper’s strongest claim. INTELLECT-2 was a reinforcement-learning post-train on an existing 32B base, not a from-scratch pre-train, and the over-the-internet pre-training runs (Consilience-40B at 20 trillion tokens, Covenant-72B at roughly 1.1 trillion) still sit below the token budgets of a centrally trained frontier model like Llama 3.1 405B, which saw 15 trillion. The decentralized runs also carry verification overhead the feasibility model does not fully price in. Psyche and Prime Intellect both spend compute proving that contributed work was done honestly, because a permissionless network cannot trust its own nodes. None of that rescues the threshold. The trend line points one way, and the optimizer lineage, from DiLoCo to DisTrO to SparseLoCo, is a record of the communication wall falling release by release.

What the Feasibility Model Actually Shows (and Doesn’t)

The paper’s adversarial model is deliberately constrained: each node is limited in compute, bandwidth is capped at consumer-grade levels, and latency is set to reflect typical internet connections. These parameters were chosen because they approximate what a well-resourced actor could assemble without attracting the attention of compute-governance regimes that track large GPU clusters. Scher et al. (2025) proposed the most restrictive regime in the literature: banning pre-training above 10^24 FLOP (arXiv:2605.29359). The paper’s modeling suggests that models above that banned threshold could, in theory, be trained on sub-threshold nodes connected by consumer-grade internet.

What the paper does not show is a live demonstration. No one has trained a frontier-scale model across dispersed consumer-internet nodes. The feasibility window is also projected over a multi-year timeframe, meaning the technique may require further algorithmic improvements before it works at that scale.

The paper also acknowledges additional bypass vectors beyond distributed training. Model distillation and mixture-of-agents techniques could produce capable models without any single training run crossing a FLOP threshold. The paper flags these as complicating factors but does not model them quantitatively (arXiv:2605.29359).

The Regulatory Patch: Chips, Memory, and Forensic Accounting

If FLOP-based thresholds are bypassable, the paper recommends shifting the regulatory target. Its proposed countermeasures include chip-level tracking, forensic accounting of hardware purchases, whistleblowing incentives, and registration requirements triggered by both compute throughput and accelerator memory thresholds rather than FLOP counts alone (arXiv:2605.29359). The logic is straightforward: tracking physical hardware is harder to spoof than auditing a training run’s compute budget after the fact.

The EU’s current framework already contains hooks for amendment. Article 51 grants the Commission authority to adjust the 10^25 FLOP threshold via delegated acts to keep pace with algorithmic improvements (Article 51). The systemic-risk threshold is noted as “under review” in current EU guidance (EU AI Act GPAI obligations). EU guidelines also require providers to notify the Commission within two weeks when models meet or are expected to meet systemic-risk thresholds, extending to the planning phase (EU regulatory guidelines).

The problem is that lowering the FLOP threshold to catch distributed training also sweeps in smaller, legitimate actors whose combined compute is modest by any standard. Registration triggered by accelerator memory or cluster size, as the paper suggests, is a more targeted signal, but it requires a hardware-tracking infrastructure that does not currently exist at international scale.

Where the Thresholds Stand in Mid-2026

The regulatory picture has diverged sharply by jurisdiction since the paper appeared [Updated June 2026]. At the US federal level there is now no compute-threshold reporting trigger at all: Executive Order 14110’s 10^26 FLOP line was revoked in January 2025, and the replacement order set no successor. The only live FLOP threshold in US law is California’s, and SB 53 uses its 10^26 bar to require published safety frameworks and incident reporting from large developers rather than to gate a training run.

The EU is moving the opposite direction. The Article 51 obligations for systemic-risk models became binding in August 2025, and the Commission’s enforcement powers activate in August 2026. By early 2026 around a dozen models had already crossed the 10^25 FLOP systemic-risk line, including systems from OpenAI, Google, Anthropic, Meta, and Mistral, so the threshold is no longer a hypothetical line in the text. The two-tier structure also remains intact: 10^23 FLOP triggers baseline GPAI classification, 10^25 triggers the systemic-risk presumption. That timing is the awkward part. Enforcement arrives precisely as the cheapest evasion route, splitting a run across sub-threshold sites, is being demonstrated at successively larger scales in public. A static FLOP count is the one variable a determined actor can engineer around, and the EU’s own provision to revise the number by delegated act is a standing admission that it is provisional.

What would survive distributed training is a trigger that does not depend on where the FLOPs are spent. Capability-based evaluation, accelerator-memory and interconnect registration, and chip-level attestation all aim at that target, but each trades the clean auditability of a single number for measurement problems of its own, and none is implemented across borders today. The same gap shows up in adjacent governance debates, from the EU AI Act’s transparency rule to the structural lessons aviation certification holds for AI oversight: the regulation is only as enforceable as the technical artifact it can actually inspect.

Why Enforcement Gets Harder, Not Easier, After the Fix

The structural problem is not that regulators picked the wrong FLOP number. It is that any static threshold will erode as training algorithms become more communication-efficient. Each improvement widens the gap between what a threshold was designed to catch and what a dispersed training run actually needs.

The enforcement problem compounds. In a world where consumer-grade nodes with a handful of accelerators can collectively train a frontier model, the observable surface area for enforcement expands from a few hundred data centers to potentially thousands of small installations. Chip-level provenance tracking, if it existed, would narrow that surface. But building that tracking system requires the same international coordination that has stalled every prior attempt at compute governance, and the technical capability to evade it improves on a shorter timeline than the regulatory process typically operates on.

Frequently Asked Questions

Does the EU’s 10^23 FLOP GPAI classification threshold also get bypassed by distributed training?

Yes. The EU operates two tiers: 10^23 FLOP triggers GPAI classification, while 10^25 FLOP triggers systemic-risk designation. Distributed training splits the run so no single node crosses either bar. The 10^23 FLOP tier is low enough that a single node with 16 H100 GPUs running a long training job could approach it on its own, meaning regulators cannot simply lower the threshold without pulling individual research workstations into the reporting regime.

How do enforcement penalties differ between the EU and California frameworks?

The EU AI Act permits fines up to 3% of a provider’s global annual turnover or EUR 15 million, whichever is higher. California’s SB 53 caps penalties at $1 million. For a large AI lab, EU exposure could be two to three orders of magnitude larger. Neither framework’s penalty structure currently accounts for whether a training run was distributed across sub-threshold nodes.

Is there a tool policymakers can use to model distributed training feasibility?

The paper’s authors published an interactive simulator at intelligence.org/research/distributed-training-simulator with open-source code on GitHub. Users specify bandwidth, latency, compute-per-node, and target model size, and the tool reports whether the configuration is feasible under current algorithmic assumptions. It is designed for governance bodies that need to set thresholds grounded in technical constraints rather than static FLOP counts.

How large is the gap between what was modeled and what has been empirically demonstrated?

The paper’s feasibility claim targets a Llama 3.1-405B-class model (roughly 405 billion parameters) trained across dispersed nodes, and that specific scale remains modeling rather than a completed run. But the empirical floor has moved well past the LLAMA2 7B packet-loss study the original analysis leaned on [Updated June 2026]. Prime Intellect’s INTELLECT-2 (32B) was trained by globally distributed reinforcement learning over a permissionless GPU swarm, Nous Research’s Psyche network pushed a 40B base model to 20 trillion tokens over the internet, and Covenant-72B pre-trained a 72-billion-parameter model across trustless peers in March 2026. The gap between demonstrated and frontier scale has narrowed from roughly 58x to under 6x, and these runs used real dispersed hardware rather than a simulated network. The remaining distance is in token budget and from-scratch pre-training at the largest sizes, not in whether the approach works.