There Will Be a Scientific Theory of Deep Learning: What arXiv 2604.21691 Argues and Where It Will Lose

Fourteen deep-learning theorists submitted a paper¹ on April 23, 2026, several of them co-authors of the partial theories being synthesized. The paper proposes calling the emerging framework “learning mechanics,” points to five converging theoretical strands as evidence the program is already underway, and catalogs nine open directions where it isn’t. It hit Hacker News within 24 hours: 350 points, 155 comments, mixed reception.² This is not a new theory. It is a coordinated naming bid.

The Pitch

The authorship is not incidental. Arthur Jacot co-authored the original NTK paper in 2018. Eric Michaud authored the quanta hypothesis, which attempts to ground scaling-law exponents in a discrete model of concept learning. Blake Bordelon contributed a formal model of scaling laws. The paper is therefore an insider synthesis: the authors of the cited work are arguing that their own portfolio constitutes a converging program.

The proposed framework is styled after classical mechanics. “Learning mechanics” is defined by focus on training dynamics, coarse aggregate statistics, and falsifiable quantitative predictions. The companion site learningmechanics.pub³ aggregates essays and promises a forthcoming visual guide to progressive sharpening and edge of stability. Whether the community adopts the name is a different question from whether the underlying results are real.

The Five Strands, and Where Each Breaks

The paper identifies five convergent strands:

Solvable idealized settings. Deep linear networks are analytically tractable. They don’t learn features. Whether the linear case teaches you anything about the nonlinear case at practical scale is Open Direction 1. The paper lists it, which is honest; it doesn’t resolve it.

Tractable limits. The paper is unusually direct about NTK: the infinite-width linearized limit shows “no feature learning,” produces overly pessimistic predictions for sample complexity, and sidesteps “the intrinsically nonconvex optimization phenomena of deep learning.” The conclusion is that NTK “is not the right one to study.” The carveout: “recent evidence suggests that language model fine-tuning occurs in a near-linearized regime.” NTK is not dead. It describes a narrower slice than its advocates initially claimed.

Simple mathematical laws. Scaling laws exist and are empirically robust. The problem is that the paper concedes “no framework can robustly predict the observed exponents a priori from dataset and architectural properties across realistic settings.” Predicting those exponents from first principles is Open Direction 7. This is the deepest concession in the paper. Practitioners most want a formula for whether to scale data or parameters at a given compute budget, and that formula remains empirical.

Theories of hyperparameters. The paper treats learning-rate transfer and related hyperparameter scaling results as genuine theory: falsifiable predictions engineers can act on. This strand has the clearest practical payoff.

Universal behaviors. Edge of stability is the flagship example. Full-batch gradient descent with learning rate η produces progressive sharpening, followed by stabilization of the largest Hessian eigenvalue near 2/η. Damian et al. (2022) explained the 2/η stabilization via third-order curvature; Cohen et al. (2025) decomposed the dynamics into a time-averaged gradient flow plus oscillations. What the theory cannot yet explain is progressive sharpening in nonlinear networks (Open Direction 8).

What the Theory Would Have to Predict

A theory that cannot predict scaling exponents from first principles cannot tell you, before training, what your loss curve slope will be. A theory that cannot explain progressive sharpening in nonlinear networks cannot tell you whether your learning rate schedule is safe at scale. These are the two questions that training-stability and compute-efficiency work actually turn on.

The paper frames these as open directions within a converging program. That is a reasonable framing. It is also unfalsifiable in the short run: any gap can be labeled an open direction and the program survives indefinitely without resolving it.

The Counter-Positions

Three objections survive Section 4’s rebuttals.

Scale is all. HN commenter sweezyjeezy² put it directly: “without the stupid amount of data that is available now, the architecture would be kind of irrelevant.” If giga-datasets are the irreducible complexity, and the gains of the past several years came from the data distribution rather than from any property of the optimizer, then a theory of learning dynamics is explaining a second-order effect. The paper treats data theory as a complementary concern (objection 4 in Section 4). A skeptic would call it the primary one.

The alchemy position. The Rahimi/Recht tradition, going back to the 2017 NeurIPS talk, argues that deep learning has become engineering folklore: practitioners running grid searches over heuristics, getting away with it, and calling the results science. The paper’s implicit target is this critique. The stronger version of the counter-position is not that deep learning is mysterious. It’s that the empirical engineering discipline is itself the science, and that a theory layer above it adds explanatory overhead without predictive gain. As commenter psyklic observed², practitioners “have little patience for research when the engineering is moving so quickly.”

Toy models aren’t LLMs. Deep linear networks are solvable. Transformers trained on internet-scale text with many thousands of post-training engineering hours are not. The paper’s rebuttal is that tractable models reveal mechanisms that transfer. The burden of proof is on the authors to demonstrate transfer, not on skeptics to assume it.

What Changes for Practitioners Through 2027

The position paper’s most practical near-term effect may be to give theorists a vocabulary for defending half-built theory in grant reviews. “We don’t know why this works” has always been a liability in review contexts; “this is an open direction within a converging program” is a better sentence. Whether that sentence reflects genuine progress or organizational packaging depends on which strands mature.

The hyperparameter-theory strand is the most likely to produce actionable engineering artifacts before 2027: calibrated learning-rate scaling rules, more principled warm-up schedules. Edge-of-stability dynamics are already informing choices about full-batch versus mini-batch regimes in academic labs; practitioner uptake is slower.

Generalization bounds remain largely decorative. The paper’s own framing¹ treats generalization as an open strand, not a solved one. If a team is using a PAC-Bayes bound to argue their model won’t overfit production data, that bound is probably not tight enough to be load-bearing.

Where the Bet Is Most Likely to Lose

The most comfortable failure mode for “learning mechanics” is that it succeeds as a label without producing a single result that changes how a practitioner decides to train a model. That is what most theoretical physics subdisciplines look like from the outside; it is not obviously a disaster for the theorists involved.

The less comfortable failure mode: Open Directions 1 and 7 remain open long enough that the naming bid becomes historical curiosity rather than program. The NTK trajectory is instructive here. Jacot co-authored it in 2018. The paper he now co-authors¹ describes NTK as “not the right limit to study” before adding the fine-tuning carveout. Named, scoped down, partially rehabilitated. That arc is documented in the same paper now proposing the next synthesis.

“Learning mechanics” starts with a broader claim and more strands. Whether broader scope makes it more robust or just harder to falsify is the question the nine open directions in the 41-page paper¹ will eventually answer.

Frequently Asked Questions

Does the paper treat mean-field and NTK as equally viable tractable limits, or does it implicitly rank them?

The paper groups both under ‘tractable limits,’ but the article body focuses almost exclusively on NTK’s failures. The research brief notes mean-field suffers from an ‘empty-init problem’ that the paper does not emphasize, suggesting the authors treat NTK as the more discussed, and more narrowly rehabilitated, limit despite the symmetric framing.

How does the ‘learning mechanics’ launch differ from how the NTK theory was introduced in 2018?

NTK began as a single 2018 technical result with a precise mathematical definition. ‘Learning mechanics’ is a 41-page, 14-author position paper released under CC BY 4.0, backed by a companion site aggregating essays and a forthcoming visual guide. Where NTK was a theorem, this is a multi-media campaign.

What are the five skeptical objections cataloged in Section 4, and which one does the body least address?

The paper lists: decades of failure, toy models being too primitive, wrong level of resolution, the need for a theory of data, and AI understanding itself first. The body discusses the first four implicitly, but never engages the fifth, that automated systems might outpace human theorists, despite its direct threat to the field’s funding rationale.

What does the CC BY 4.0 license imply for practitioners compared to a standard arXiv preprint?

Most arXiv preprints use a more restrictive license. The authors chose CC BY 4.0, permitting commercial remixing. That lowers the barrier for practitioners to translate the framework into training documentation, internal blog posts, or educational material, an unusual step for a theoretical position paper that signals an ambition for brand penetration beyond citation counts.

How might the polarized Hacker News reception shape the framework’s near-term trajectory?

The thread included commenter mathisfun123’s charge that ‘the only people for whom this is an open question are the academics.’ That level of practitioner hostility on a 350-point front-page thread means ‘learning mechanics’ is now publicly contested terrain. If the promised companion-site deliverables stall, the backlash will be louder than for a typical preprint.