groundy
models & research

A Per-Neuron Sequence Model Was Withdrawn From arXiv as Coverage Hailed It

TND proposed per-neuron dynamics as a sequence-modeling primitive, then was withdrawn from arXiv for accuracy errors the day before coverage called it a Transformer.

8 min · · · 4 sources ↓

On June 19, 2026, a preprint called Topological Neural Dynamics (TND) proposed modeling sequences not with attention but with per-neuron dynamical updates over a directed graph. Four days later its author withdrew it from arXiv for accuracy errors. The day after that, Machine Brief published a column calling the same paper the start of “a new era in sequence modeling,” with no mention of the retraction. The gap between arXiv status and secondary coverage is the actual story.

What does TND actually propose?

TND’s central move is to relocate the unit of computation from the layer to the individual neuron, treating sequence processing as a dynamical system evolving over an explicit neuron-interaction graph rather than as a stack of layer transforms.

In layer-wise models, every neuron in a layer co-evolves through a shared operator. A vanilla RNN applies the same recurrence h_t = f(h_{t-1}, x_t) across all hidden units. An attention layer mixes token representations across positions with one operator. Continuous-time networks integrate a shared differential equation per layer. The unit of computation is the layer, and the dynamics are uniform across it.

TND breaks that symmetry. It represents the system as three pieces: a directed graph of neurons, an interaction operator over the edges, and a local dynamics function attached to each neuron. Each neuron carries its own update rule, and collective computation emerges from how those local rules couple through the graph. The wiring is the architecture; the layer, as a uniform transform, disappears.

The payoff, if the formulation held, would be a third sequence-modeling primitive alongside quadratic attention and structured recurrence. The design question would shift from “which attention variant” to “which neuron dynamics and which wiring.” There is also an interpretability angle: if each neuron’s role is explicit in its local function and its edges, what a neuron computes is, in principle, readable off the graph, which is hard to claim for a transformer where meaning is smeared across heads. Whether that promise survives scaling is exactly the question the submitted paper did not answer.

Why was the paper withdrawn?

TND was pulled from arXiv on 2026-06-23 by its author Borui Cai, four days after it appeared, with a note that “the experiments have some errors regarding model accuracy and need to be updated.”

The v2 record is a 1 KB stub marked withdrawn. Cai did not specify which experiments were affected, which baselines were wrong, or how large the errors were. arXiv’s withdrawal mechanism does not delete a record; it leaves the abstract in place and appends a retraction comment. In this case the v1 PDF is also no longer downloadable, so what remains is the abstract text and the one-line admission.

The timing is what makes the coverage gap readable. TND appeared June 19. It was withdrawn June 23. Machine Brief’s column ran June 24. A same-day check of the arXiv landing page on the 24th would have shown the withdrawal comment at the top of the record. It was not reflected in the column, which instead framed TND as a provocation that “fundamentally challenges the conventions of sequence modeling.”

This is the failure mode that ages worst for secondary coverage. A preprint is a draft, and treating a draft as a finished result is a known sin. Treating a withdrawn draft as a finished result is a worse one, and it compounds: the column is now the artifact readers find when they search for the paper, and it tells a story the primary source no longer supports. The correction will travel slower than the original.

Can the Pong result be trusted?

No. The headline figure, a mean of 17.47 consecutive catches per round on a single-player Pong behavior-cloning task, comes from the now-withdrawn abstract, and the supporting PDF is no longer on arXiv.

Even before the retraction, the figure was weak evidence for the claim attached to it. The abstract described 17.47 catches as more than three times the strongest of five baselines: Vanilla RNN, Sparse RNN, LSTM, CfC, and a transformer. The comparison was not parameter-matched, and the transformer baseline was unspecified beyond the word. The only evaluation domain was Pong.

Pong behavior cloning tests whether an architecture can fit a short-horizon control policy from demonstrations. The state-to-action mapping has a temporal horizon of a few frames. There is no dependency spanning hundreds or thousands of steps, which is precisely the regime where attention and structured recurrence are argued to matter. A clean, parameter-matched win on Pong would be evidence about short-horizon function fitting, not about long-range sequence modeling. And the figure is no longer clean: it comes from a retracted abstract, repeated without retraction context in the one piece of secondary coverage that exists (Machine Brief, 2026-06-24).

The honest reading is that there is currently no trustworthy performance number for TND at all.

Where does per-neuron dynamics sit relative to attention and SSMs?

Conceptually, TND belongs to the dynamical-systems thread that treats sequence modeling as integration over time rather than as lookup-and-mix, and it sits alongside Neural SDEs rather than head-to-head with the transformer.

The dominant framing in architecture search has been a two-axis choice. Attention is quadratic in sequence length but fully parallel in training. Recurrence is linear but sequential. Structured state-space models re-entered the conversation by making recurrence structured enough to parallelize, narrowing the cost gap on long sequences while keeping the linear-in-length floor. That debate kept the unit of computation at the layer: attention is a layer operator, and an SSM cell is a layer operator.

The continuous-time lineage, which includes CfC (one of TND’s own baselines) and the Neural SDE line of work from January 2025, pushes on a different assumption. Neural SDEs reinterpret a sequence as discrete samples from an underlying continuous process parameterized by neural-network drift and diffusion terms. TND shares the impulse to replace the layer with dynamics, but it localizes the dynamics per neuron and wires them through a graph, where Neural SDEs keep the dynamics over a continuous state space. The common claim is that the right primitive is a dynamical system, not a layer transform.

Two consequences matter for builders watching this space. The first is a scaling question the abstract never addresses. A directed graph over N neurons can carry up to N² edges, the same asymptotic cost as full attention. The interesting case is a sparse graph, where each neuron reads from a small neighborhood; but then the question is whether sparse per-neuron dynamics can route information across long distances, or whether the graph needs enough hops to reintroduce sequential cost. That is the attention-versus-recurrence tension restated at the neuron level, and it is unresolved.

The second is a track-record point. The dynamical-systems thread keeps producing provocative architectures and keeps struggling to show clean wins on the benchmarks that matter for language work. TND as submitted did not break that pattern. Until a corrected version lands parameter-matched results on a standard sequence benchmark, per-neuron dynamics is a design axis worth knowing about, not a result worth acting on.

What would a credible repost need to show?

A corrected TND would need parameter-matched baselines on at least one standard sequence benchmark before the per-neuron framing deserves to be taken as a serious alternative to attention, and the Pong case study alone cannot carry that weight.

Three things, specifically. Parameter-matched comparisons against a modern transformer and at least one structured-recurrence model, so any multiple reflects capacity rather than an undertuned baseline. A non-toy domain, language-modeling perplexity or long-context retrieval, because behavior cloning on Pong proves nothing about long-range dependency handling. And a scaling analysis on the interaction graph, because a neuron-wise primitive is only interesting as an attention alternative if its wiring cost stays sub-quadratic as the model grows.

The bar is not unreasonable, and the lineage has the tooling. CfC, which already appeared as a TND baseline, has published comparisons on real control tasks. Neural SDEs carries the mathematical scaffolding for honest parameter accounting. A repost that engages those baselines would move the conversation. A repost that quietly fixes the accuracy errors and re-runs the same Pong comparison would not.

Until then, the honest position is to track the design axis and set the numbers aside. Per-neuron dynamics is an idea that earns a paragraph in any survey of sequence-modeling primitives. The specific preprint that put it in front of readers this week is, for the moment, evidence of nothing except how fast a withdrawn draft can become a headline.

Frequently Asked Questions

How can I verify whether TND has been reposted before citing its numbers?

The arXiv ID 2606.21295 follows the YYMM.NNNNN convention, where 26.06 marks June 2026, so the landing URL is stable and versioned. A corrected repost would surface as a new version entry rather than a silent edit, and the withdrawal comment at the top of the page would be replaced or extended by the author.

How does TND’s graph dynamics differ from the Neural SDE approach?

Neural SDEs treat a sequence as discrete samples from a continuous stochastic process with learned drift and diffusion terms, so noise is built into the formulation. TND attaches a deterministic local update rule to each neuron and routes signals through directed edges, making the wiring pattern the design variable rather than a stochastic term.

Why does the unspecified transformer baseline weaken the 3x catch claim?

The abstract named a transformer among the five baselines without specifying its size, training budget, or variant, so the comparison cannot be independently rechecked. If that single baseline was undertuned, both the threefold edge and the baseline ranking it produced could collapse, which is exactly the class of error withdrawal notes tend to accompany.

What is CfC, the continuous-time model TND listed as a baseline?

Closed-form Continuous-time networks, introduced in 2022 by the MIT Liquid Time-constant group, integrate a differential equation in closed form to skip the expensive ODE solver that earlier continuous-time models required. Its published benchmarks on physical control tasks are the reason a corrected TND would need to engage it on a non-toy domain.

What should a team watching this space actually do this week?

Hold off on touching any model stack. Add the TND arXiv landing page to a watchlist, and treat any late-June 2026 secondary coverage that quotes a 17 or threefold figure as unreliable unless it carries the withdrawal note from June 23.

sources · 4 cited

  1. Topological Neural Dynamics: A New Era in Sequence Modeling machinebrief.com analysis accessed 2026-06-25
  2. Topology (Wikipedia) en.wikipedia.org vendor accessed 2026-06-25