Deep Learning Theory Is Broken — And Maybe Unfixable · history
Version 4
2026-04-30 12:04 UTC · 130 items
Narrative
The thread continues to expand in two directions simultaneously: deepening attention to the 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' paper, and a widening debate over whether memorization is causally necessary for generalization. The dynamical decoupling result (previously noted as a NeurIPS 2025 poster[1]) has now accumulated a substantial citation footprint — OpenReview listing[2], Semantic Scholar indexing[3], an arxiv HTML version[4], and a YouTube talk[5] — collectively confirming that this paper by Montanari and Urbani is achieving serious academic traction, not merely circulating as a preprint. The core claim — that generalization and overfitting dynamics decouple in large two-layer networks — is now visibly attracting the broader theoretical ML community's attention.
The memorization-and-generalization cluster has deepened in a potentially significant way. The Feldman 'Does Learning Require Memorization?' paper[6][7][8] has been joined by a distinct OpenReview submission titled 'Is Memorization Actually Necessary for Generalization?'[9] — a title that reads as a direct challenge to Feldman's affirmative conclusion. While no claims have been extracted from this item, its existence signals that the causal memorization hypothesis is now contested enough to generate formal rebuttal attempts. The Feldman paper's other mirrors (YouTube talk[10], Simons Berkeley slides[11], Apple ML research page[12], Chiyuan Zhang's project page[13]) suggest the paper is also undergoing a fresh wave of citation and discussion, likely driven by LawrenceC's series framing memorization as central to the theory-failure story. Daniel Kunin's Twitter activity[14] is noted but no content was extracted.
The structural picture is unchanged but sharpening: the dynamical decoupling result and the memorization debate are the two poles attracting the most new attention. The dynamical decoupling paper has a clear architectural argument — it says that in the large two-layer regime, the dynamics that govern test error improvement and the dynamics that govern training loss interpolation are separable — which, if it extends to deeper networks, would constitute a partial answer to Nagarajan and Kolter's challenge by offering an algorithm-dependent structural account of generalization. The memorization debate is pulling in the opposite direction: if memorization is not merely compatible with but necessary for generalization on long-tailed distributions, then the Zhang et al. 2016 result is not evidence of theory failure but evidence that theory was optimizing the wrong objective. Neither resolution has arrived, but both debates are actively attracting formal responses.
The noise items in this cycle — a Facebook post on technology complexity[15] and a generic Medium tutorial on overfitting prevention[16] — contain no relevant theoretical content and likely reflect search contamination. The overall discourse remains concentrated on the Alignment Forum / arXiv / OpenReview axis, with Simon and Kunin's learning mechanics program still the affirmative pole and LawrenceC's critical series still the primary lens through which the discourse is organized.
Timeline
- 2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [18]
- 2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [17][44][45]
- 2019-06-01: Feldman publishes 'Does Learning Require Memorization? A Short Tale about a Long Tail,' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions — reframing Zhang et al.'s finding as evidence that memorization is part of learning, not evidence of theory failure. [6][7][8][11][13][12]
- 2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to actually explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [27][28][29]
- 2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks using a mean-field view offers a new structural lens on why overparameterization does not prevent generalization. [46][47]
- 2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales. [30]
- 2025-01-01: NeurIPS 2025 poster 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) presents results separating generalization dynamics from overfitting dynamics in the large two-layer regime; paper subsequently receives OpenReview listing, Semantic Scholar indexing, and a YouTube talk, confirming broad academic uptake. [1][4][24][2][3][25][5]
- 2026-04-24: Jamie Simon and Daniel Kunin (UC Berkeley) appear on Imbue's podcast arguing that a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [21]
- 2026-04-25: LawrenceC publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [19]
- 2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [18]
- 2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization; the post is crossposted to LessWrong and drives renewed interest in the original paper. [17][20]
- 2026-04-01: YouTube video 'There Will Be a Scientific Theory of Deep Learning' (Apr 2026) is published, likely corresponding to Simon and Kunin's learning mechanics presentation. [22]
- 2026-04-01: Alignment Forum quick paper review of 'There Will Be a Scientific Theory of Deep Learning' appears, representing a second community voice engaging directly with the learning mechanics paper. [23]
- 2026-04-30: OpenReview submission 'Is Memorization Actually Necessary for Generalization?' appears, representing a formal challenge to Feldman's affirmative claim and deepening the memorization-causation debate. [9]
Perspectives
LawrenceC (Alignment Forum / LessWrong)
Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling, explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.
Evolution: Consistent across all three posts in the series; his critique is now driving organic rediscovery of the Nagarajan-Kolter paper via LessWrong crosspost and multiple mirrors, and also driving renewed attention to Feldman's memorization paper.
Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)
A scientific theory of deep learning is achievable; their 'learning mechanics' framework, grounded in average-case training dynamics and aggregate statistics, is the right approach. Publicly promoted via Imbue podcast (April 24, 2026), YouTube presentation, and ongoing Twitter activity.
Evolution: Newly surfaced as a direct voice in the discourse. Previously known only through LawrenceC's critical review; now publicly advocating their position independently via multiple channels. Kunin is also active on Twitter, though content is not yet extracted.
Montanari and Urbani (dynamical decoupling)
In the large two-layer network regime, generalization dynamics and overfitting dynamics are separable — a result that, if it extends to deeper networks, would provide an algorithm-dependent structural account of why networks generalize despite interpolating training data.
Evolution: Previously a single NeurIPS 2025 poster citation. Now confirmed as a substantially circulating paper with OpenReview listing, Semantic Scholar indexing, and a YouTube talk — indicating significant uptake by the theoretical ML community.
Vitaly Feldman (memorization is necessary)
Memorization of tail examples is causally necessary for learning from long-tailed distributions. This reframes Zhang et al.'s result: memorization is part of learning, not evidence that theory is broken.
Evolution: Previously cited as a single item. Now the paper is receiving a fresh wave of attention (YouTube talk, Simons Berkeley slides, Apple ML page, ACM DL), likely driven by LawrenceC's series. A formal challenge — 'Is Memorization Actually Necessary for Generalization?' — has now appeared on OpenReview, representing the first direct rebuttal in this cluster.
PAC-Bayes compression bounds researchers (NeurIPS 2022)
PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.
Evolution: Consistent; no direct engagement with LawrenceC's critique yet apparent.
CMU CSD PhD Blog
Classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales.
Evolution: Consistent; potentially a significant counterpoint to LawrenceC's universal pessimism but has not attracted direct engagement.
Algorithmic stability / SGD stability researchers
Generalization in deep learning can be explained through the stability properties of SGD — how much the algorithm's output changes under small data perturbations. This approach is inherently algorithm- and data-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias.
Evolution: Consistent; a substantial cluster of papers spanning fine-grained stability, Markov-chain SGD, momentum under heavy-tailed noise, and nonsmooth losses continues to develop independently.
Tensions
- Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others (especially what specific algorithms individual networks learn) permanently outside its scope? [19][21][22][23][14]
- Is memorization causally necessary for generalization on long-tailed distributions, or is it merely correlated? Feldman's affirmative argument is now contested by a direct OpenReview submission ('Is Memorization Actually Necessary for Generalization?'), and the outcome reshapes how Zhang et al.'s 2016 result should be interpreted: either memorization signals theory failure, or it signals that theory was measuring the wrong objective. [18][6][9][10][7][8]
- Nearly a decade after Nagarajan and Kolter precisely diagnosed the failure of uniform convergence, no satisfactory algorithm- and data-dependent generalization theory has emerged. The dynamical decoupling result for two-layer networks and the algorithmic stability cluster are the newest candidates with the right structural properties, but neither has yet been shown to extend to the multi-layer, non-linear regime of modern LLMs. [17][1][4][2][5][31][32][33][37][38]
- PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while a separate line of work argues tighter bounds always come at a price. Which claim holds, and does either constitute a satisfying theoretical explanation rather than a post-hoc certificate? [27][28][39][29]
- If classical generalization theory is more predictive for foundation models than for conventional deep networks, is the theory-failure diagnosis architecture-specific — and does that change the urgency or direction of the theory-building program? [30]
- The NeurIPS 2025 dynamical decoupling result establishes that generalization and overfitting are separable phenomena in large two-layer networks. Does this separation persist in deeper networks, and if so, does it support or complicate the learning mechanics program's focus on aggregate training dynamics? [1][4][24][2][3][25][5]
- Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions, but those conditions are typically verified only in simplified settings. Does benign overfitting theory scale to the overparameterized, multi-layer, non-linear regime of modern LLMs? [40][41][42][43]
Sources
- [1] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
- [2] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
- [3] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [4] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [5] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
- [6] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
- [7] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
- [8] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
- [9] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
- [10] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
- [11] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
- [12] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
- [13] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
- [14] Daniel Kunin — reactive:deep-learning-theory-limits
- [15] Technology promised simplicity. It delivered complexity. - Facebook — reactive:deep-learning-theory-limits
- [16] How to Prevent Overfitting and Enhance Generalization in Deep ... — reactive:deep-learning-theory-limits
- [17] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
- [18] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
- [19] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
- [20] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
- [21] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
- [22] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
- [23] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
- [24] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [25] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [26] [PDF] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
- [27] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
- [28] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
- [29] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
- [30] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits
- [31] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
- [32] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
- [33] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
- [34] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
- [35] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
- [36] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
- [37] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
- [38] Stability (learning theory) — reactive:deep-learning-theory-limits
- [39] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
- [40] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
- [41] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
- [42] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
- [43] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
- [44] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [45] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [46] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
- [47] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits