Deep Learning Theory Is Broken — And Maybe Unfixable · history

Version 3

2026-04-30 04:31 UTC · 110 items

Narrative

The thread has meaningfully evolved since the last synthesis cycle. Most significantly, Jamie Simon and Daniel Kunin — the UC Berkeley authors of the 'learning mechanics' manifesto that LawrenceC critically reviewed — have now entered the public discourse directly: an Imbue podcast episode from April 24, 2026 features them arguing their case[1], and a YouTube presentation titled 'There Will Be a Scientific Theory of Deep Learning' from April 2026 carries the same optimistic banner[2]. A quick paper review on the Alignment Forum has also appeared[3], suggesting the community is beginning to engage with Simon and Kunin's claims in their own voice rather than solely through LawrenceC's critical interpretation. This is the first evidence of a second substantive voice emerging in the discourse, though extracted claims from these sources are not yet available.

The new pipeline cycle has surfaced two additional theoretical research clusters not previously prominent. The most substantive is algorithmic stability — the framework that bounds generalization by measuring how much a learning algorithm's output changes when individual training examples are perturbed[4]. A dense cluster of SGD-stability papers has appeared, spanning fine-grained analyses[5][6], data-dependent bounds[7], Markov-chain SGD[8], momentum under heavy-tailed noise[9], nonsmooth convex losses[10], and a 2025 arxiv preprint extending these results[11]. Algorithmic stability has a structural advantage over uniform convergence: it is inherently algorithm-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias[12]. A NeurIPS 2025 poster on 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks'[13] and NeurIPS 2024 mean-field symmetry work[14][15] represent further theoretical advances in the overparameterized regime. The mean-field cluster[16][17][18] has also deepened, adding a framework for ResNets and GNNs.

A subtler complication surfaces in item 2482: 'Does learning require memorization?' — a question that inverts Zhang et al.'s framing. Rather than treating memorization as evidence that generalization bounds are vacuous, this line of work asks whether memorization of tail examples is causally necessary for learning from long-tailed distributions[19]. If correct, this would reframe Zhang et al.'s finding from 'theory is broken because networks memorize' to 'theory was measuring the wrong thing, because memorization is part of learning.' Meanwhile, the Nagarajan-Kolter paper has attracted fresh attention via a LessWrong crosspost[20] and multiple mirror links[21][22][23][24][25][26][27][12], suggesting that LawrenceC's series is continuing to drive organic rediscovery of the original source material.

The structural situation has improved but remains lopsided. Simon and Kunin are now active public advocates for their theory, and the Alignment Forum community is beginning to review their paper directly. Whether Simon and Kunin address LawrenceC's specific criticism — that learning mechanics cannot explain what specific algorithms individual networks learn — remains unknown without extracted claims. The five candidate replacement frameworks (benign overfitting, double descent, PAC-Bayes, NTK, algorithmic stability) continue to generate independent technical progress without converging or being integrated, and none has yet clearly answered the core challenge LawrenceC identified.

Timeline

2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [29]
2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [28][23][12]
2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to actually explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [31][32][33]
2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks using a mean-field view offers a new structural lens on why overparameterization does not prevent generalization. [14][15]
2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales. [34]
2025-01-01: NeurIPS 2025 poster on 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' presents new theoretical results separating generalization dynamics from overfitting dynamics. [13]
2026-04-24: Jamie Simon and Daniel Kunin (UC Berkeley) appear on Imbue's podcast arguing that a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [1]
2026-04-25: LawrenceC publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [30]
2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [29]
2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization; the post is crossposted to LessWrong and drives renewed interest in the original paper. [28][20]
2026-04-01: YouTube video 'There Will Be a Scientific Theory of Deep Learning' (Apr 2026) is published, likely corresponding to Simon and Kunin's learning mechanics presentation. [2]
2026-04-01: Alignment Forum quick paper review of 'There Will Be a Scientific Theory of Deep Learning' appears, representing a second community voice engaging directly with the learning mechanics paper. [3]

Perspectives

LawrenceC (Alignment Forum / LessWrong)

Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling, explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.

Evolution: Consistent across all three posts in the series; his critique is now driving organic rediscovery of the Nagarajan-Kolter paper via LessWrong crosspost and multiple mirrors.

[28][29][30][20]

Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)

A scientific theory of deep learning is achievable; their 'learning mechanics' framework, grounded in average-case training dynamics and aggregate statistics, is the right approach. Publicly promoted via Imbue podcast (April 24, 2026) and YouTube presentation.

Evolution: Newly surfaced as a direct voice in the discourse. Previously known only through LawrenceC's critical review; now publicly advocating their position independently.

[1][2][3]

PAC-Bayes compression bounds researchers (NeurIPS 2022)

PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.

Evolution: Consistent; no direct engagement with LawrenceC's critique yet apparent.

[31][32][33]

CMU CSD PhD Blog

Classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales.

Evolution: Consistent; potentially a significant counterpoint to LawrenceC's universal pessimism but has not attracted direct engagement.

[34]

Algorithmic stability / SGD stability researchers

Generalization in deep learning can be explained through the stability properties of SGD — how much the algorithm's output changes under small data perturbations. This approach is inherently algorithm- and data-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias.

Evolution: Newly prominent as a fifth candidate framework; a substantial cluster of papers spanning fine-grained stability, Markov-chain SGD, momentum under heavy-tailed noise, and nonsmooth losses has now surfaced.

[5][7][6][8][9][10][11][4]

Tensions

Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others (especially what specific algorithms individual networks learn) permanently outside its scope? [30][1][2][3]
Nearly a decade after Nagarajan and Kolter precisely diagnosed the failure of uniform convergence, no satisfactory algorithm- and data-dependent generalization theory has emerged. Algorithmic stability is the newest candidate with the right structural properties, but it is unclear whether it can be made tight enough to explain observed generalization in deep networks. [28][5][7][6][11][4]
Zhang et al.'s observation that memorization requires only 1.5–3.5x more training steps than generalization challenged classical theory, but a separate line of work asks whether memorization of tail examples is not just compatible with but causally necessary for learning from long-tailed distributions. If so, was the theory measuring the wrong thing all along? [29][19]
PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while a separate line of work argues tighter bounds always come at a price. Which claim holds, and does either constitute a satisfying theoretical explanation rather than a post-hoc certificate? [31][32][35][33]
If classical generalization theory is more predictive for foundation models than for conventional deep networks, is the theory-failure diagnosis architecture-specific — and does that change the urgency or direction of the theory-building program? [34]
Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions, but those conditions are typically verified only in simplified settings. Does benign overfitting theory scale to the overparameterized, multi-layer, non-linear regime of modern LLMs? [36][37][38][39]
The NeurIPS 2025 result on dynamical decoupling of generalization and overfitting in large two-layer networks suggests the two phenomena may be separable phenomena with distinct dynamics — a result that could either support or complicate the learning mechanics program depending on its implications for deeper networks. [13]

Sources

[1] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
[2] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
[3] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
[4] Stability (learning theory) — reactive:deep-learning-theory-limits
[5] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[6] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[7] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[8] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
[9] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
[10] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
[11] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
[12] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[13] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[14] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[15] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[16] [PDF] a Mean-field Framework for Over-parameterized Deep Neural ... — reactive:deep-learning-theory-limits
[17] [PDF] Overparameterization of Deep ResNet: Zero Loss and Mean-field ... — reactive:deep-learning-theory-limits
[18] Generalization Error of Graph Neural Networks in the Mean-field ... — reactive:deep-learning-theory-limits
[19] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
[20] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
[21] uniform-convergence-NeurIPS19/uniform-convergence.ipynb at master · locuslab/uniform-convergence-NeurIPS19 — reactive:deep-learning-theory-limits
[22] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[23] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[24] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[25] Uniform convergence may be unable to explain generalization in deep learning | papers_we_read — reactive:deep-learning-theory-limits
[26] Uniform convergence may be unable to explain ... — reactive:deep-learning-theory-limits
[27] Uniform convergence may be unable to explain ... — reactive:deep-learning-theory-limits
[28] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
[29] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
[30] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
[31] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
[32] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[33] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[34] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits
[35] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
[36] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
[37] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
[38] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
[39] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits