Deep Learning Theory Is Broken — And Maybe Unfixable · history
Version 5
2026-04-30 20:11 UTC · 169 items
Narrative
The thread's most consequential factual correction this cycle: the Montanari and Urbani dynamical decoupling paper was not a NeurIPS 2025 poster, as previously recorded, but a NeurIPS 2025 Oral[1] — a distinction placing it in roughly the top 1–2% of submissions. This upgrade materially strengthens the paper's standing as the most technically accomplished recent response to the Nagarajan-Kolter challenge. Supporting this assessment, Urbani delivered an invited talk at the Simons Institute (Berkeley) in February 2025[2], and the paper has now accumulated archival copies across HAL Science[3], ar5iv[4], and multiple arXiv versions[5] — a diffusion footprint consistent with a result the theoretical ML community is treating as a landmark. A Twitter/X amplification from @StatMLPapers[6] confirms it has crossed into the broader ML conversation beyond conference circuits.
The memorization debate has moved from a single contested paper to a genuine research cluster. Three distinct OpenReview submissions now address the question of whether memorization is necessary for generalization[7][8][9] — each carrying a different submission ID, suggesting independent groups are converging on the same challenge to Feldman's 2019 thesis. More pointedly, a new ICLR 2025 paper titled 'When Memorization Hurts Generalization'[10] argues not merely that memorization is unnecessary but that it can actively damage generalization performance — a position considerably more adversarial to Feldman than prior critiques. If this result holds under scrutiny, it reframes the Zhang et al. 2016 finding yet again: memorization may not be a neutral or beneficial byproduct of learning on long-tailed data but a liability under certain conditions. A companion ACM survey on memorization in deep learning[11] signals that the community is now treating memorization as a first-class object of study, not a curiosity.
Two new formal results sharpen the theoretical picture. Yang et al. 2021 establishes an exact gap between generalization error and the tightest possible uniform convergence bound in random feature models[12][13] — giving Nagarajan and Kolter's empirical observation a precise algebraic formulation: it is not merely that current bounds are loose, but that uniform convergence is provably incapable of capturing the correct quantity by a measurable margin. Separately, a NeurIPS 2025 paper on diffusion models argues that implicit dynamical regularization prevents memorization in that architecture[14][15] — a result thematically close to the dynamical decoupling line, suggesting that training dynamics, not just architecture or data geometry, may be the unifying lens for understanding when and why modern networks avoid memorization. ICML 2025 has also accepted a paper rethinking benign overfitting in two-layer networks[16], continuing to probe the boundary conditions of the interpolation-generalization puzzle.
The discourse is visibly widening beyond the core generalization-bounds debate. An ICLR 2025 poster traces language models' generalization capabilities back to specific pretraining data[17][18][19], and a separate OpenReview submission asks whether transformers specifically seek patterns for memorization[20] — extending the memorization-generalization tension into the LLM regime where LawrenceC's series has argued the theory failures are most consequential. An arXiv survey on theory and mechanism of large language models[21] represents a parallel effort to systematize what is and is not understood about LLM behavior. The overall picture is one of a debate that started with two devastating impossibility results (Zhang et al. 2016, Nagarajan-Kolter 2019) and has now bifurcated: one branch is building positive theory around training dynamics (dynamical decoupling, implicit regularization, benign overfitting), while the other is contesting the empirical foundations of the memorization-as-necessary claim that had offered the most compelling reinterpretation of the original negative results.
Timeline
- 2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [23]
- 2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [22][58][59][67][68]
- 2019-06-01: Feldman publishes 'Does Learning Require Memorization? A Short Tale about a Long Tail,' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions. [38][42][69][43][44][45][46]
- 2021-01-01: Yang et al. establish an exact algebraic gap between generalization error and the tightest possible uniform convergence bound in random feature models, giving Nagarajan-Kolter a precise formal complement. [12][13]
- 2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to actually explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [47][48][49]
- 2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks using a mean-field view offers a new structural lens on why overparameterization does not prevent generalization. [70][71]
- 2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales. [72]
- 2025-02-19: Pierfrancesco Urbani (CNRS) delivers an invited talk at the Simons Institute (Berkeley) on 'Generalization and overfitting in two-layer neural networks,' extending the dynamical decoupling result's reach to a major theory venue. [2]
- 2025-01-01: ICLR 2025 paper 'When Memorization Hurts Generalization' argues memorization can actively damage generalization performance — a stronger claim than merely saying memorization is unnecessary. [10]
- 2025-01-01: NeurIPS 2025 Oral 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) presents results separating generalization dynamics from overfitting dynamics in the large two-layer regime; oral status confirmed, placing it in the top ~1-2% of NeurIPS 2025 submissions. [31][32][33][34][35][36][37][1][5][4][3][6]
- 2025-01-01: NeurIPS 2025 paper 'Why Diffusion Models Don't Memorize' attributes diffusion models' resistance to memorization to implicit dynamical regularization during training, connecting training dynamics to memorization suppression. [14][15]
- 2025-01-01: ICML 2025 poster 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits the conditions under which interpolating classifiers can generalize in the two-layer setting. [16]
- 2025-01-01: ICLR 2025 poster 'Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data' extends the memorization-generalization debate to LLMs, examining how specific pretraining data drives capability versus memorization. [17][18][19]
- 2026-04-24: Jamie Simon and Daniel Kunin (UC Berkeley) appear on Imbue's podcast arguing that a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [27]
- 2026-04-25: LawrenceC publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [24]
- 2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [23][26]
- 2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization; the post is crossposted to LessWrong and drives renewed interest in the original paper. [22][25]
- 2026-04-30: At least three distinct OpenReview submissions titled 'Is Memorization Actually Necessary for Generalization?' appear, representing independent formal challenges to Feldman's affirmative claim. [39][7][8][9][40]
Perspectives
LawrenceC (Alignment Forum / LessWrong)
Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling, explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.
Evolution: Consistent across all three posts in the series. His framing continues to organize the broader discourse; the Alignment Forum mirror of the first post confirms ongoing amplification.
Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)
A scientific theory of deep learning is achievable; their 'learning mechanics' framework, grounded in average-case training dynamics and aggregate statistics, is the right approach. Publicly promoted via Imbue podcast (April 24, 2026), YouTube presentation, and ongoing Twitter activity.
Evolution: Consistent; no direct public response to LawrenceC's critique yet apparent.
Montanari and Urbani (dynamical decoupling)
In the large two-layer network regime, generalization dynamics and overfitting dynamics are separable — a result that, if it extends to deeper networks, would provide an algorithm-dependent structural account of why networks generalize despite interpolating training data.
Evolution: Previously cited as a 'NeurIPS 2025 poster.' Corrected this cycle: the paper received NeurIPS 2025 Oral status, placing it in the top ~1-2% of submissions. Urbani also gave an invited talk at the Simons Institute in February 2025. The paper is now archived across HAL Science and ar5iv in addition to arXiv.
Vitaly Feldman (memorization is necessary)
Memorization of tail examples is causally necessary for learning from long-tailed distributions. This reframes Zhang et al.'s result: memorization is part of learning, not evidence that theory is broken.
Evolution: Previously cited as the dominant reinterpretation of Zhang et al. Now under sustained formal challenge: at least three OpenReview submissions contest the necessity claim, and a distinct ICLR 2025 paper argues memorization can actively *hurt* generalization — a position more adversarial than prior critiques. The debate has not resolved.
ICLR 2025 'When Memorization Hurts Generalization' authors
Memorization is not merely unnecessary for generalization but can actively damage it — the strongest anti-Feldman position yet to appear in a peer-reviewed venue.
Evolution: New voice this cycle. Represents a qualitative escalation from 'memorization is not required' to 'memorization is harmful,' potentially the most consequential empirical challenge to Feldman's thesis.
Yang et al. 2021 (exact gap, uniform convergence)
The failure of uniform convergence in random feature models is not an artifact of loose bounds but is provably exact — there is a measurable algebraic gap between any uniform convergence bound and the true generalization error, even in principle.
Evolution: New entry this cycle. This 2021 ICML result formalizes and strengthens Nagarajan-Kolter's 2019 empirical and PAC argument by providing an exact characterization of the shortfall.
NeurIPS 2025 diffusion model memorization researchers
Implicit dynamical regularization during training prevents memorization in diffusion models — a training-dynamics explanation for a phenomenon that would otherwise require architectural or data-geometric accounts.
Evolution: New voice this cycle. Thematically bridges the dynamical decoupling program and the memorization debate by attributing non-memorization to dynamics rather than capacity or data distribution.
PAC-Bayes compression bounds researchers (NeurIPS 2022)
PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.
Evolution: Consistent; no direct engagement with Yang et al.'s exact gap result or LawrenceC's critique yet apparent.
Algorithmic stability / SGD stability researchers
Generalization in deep learning can be explained through the stability properties of SGD. This approach is inherently algorithm- and data-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias.
Evolution: Consistent; a substantial cluster of papers spanning fine-grained stability, Markov-chain SGD, momentum under heavy-tailed noise, and nonsmooth losses continues to develop independently.
ICLR 2025 LLM memorization researchers
Language models' generalization capabilities can be traced back to specific pretraining data, implying a causal data-memorization link that extends Feldman's long-tail thesis to the LLM regime.
Evolution: New voice this cycle. Extends the memorization-generalization debate into transformer-scale models, where the theory stakes are highest.
Tensions
- Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others (especially what specific algorithms individual networks learn) permanently outside its scope? [24][27][28][29][30]
- Is memorization causally necessary, causally harmful, or merely correlated with generalization? Feldman's affirmative claim is now contested by at least three OpenReview submissions and directly contradicted by ICLR 2025's 'When Memorization Hurts Generalization' — the field has not converged on whether these results apply to different regimes, different definitions of memorization, or are genuinely contradictory. [23][38][39][7][8][10][9][40]
- Yang et al. 2021 establish an exact algebraic gap between generalization error and uniform convergence in random feature models. Does this exact gap result extend to the non-linear, multi-layer transformer regime, and if so, does it permanently close off the uniform convergence program for modern LLMs? [22][12][13][58][59]
- The NeurIPS 2025 dynamical decoupling result (now confirmed as an Oral) establishes that generalization and overfitting are separable phenomena in large two-layer networks. Does this separation persist in deeper networks and in transformer architectures, and if so, does it support or complicate the learning mechanics program's focus on aggregate training dynamics? [31][32][34][37][2][1][5][4][3]
- If implicit dynamical regularization explains why diffusion models don't memorize, does the same mechanism operate in transformers and other architectures? And if training dynamics suppress memorization in some architectures, why does memorization still appear to be necessary or harmful in others — suggesting the memorization debate may be architecture-regime-specific rather than universal. [14][15][20][17][16]
- PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while Yang et al.'s exact gap result says uniform convergence-style arguments are provably incapable of capturing the correct quantity. Are PAC-Bayes compression bounds a genuine escape hatch from the Nagarajan-Kolter impossibility, or do they fall within the class of arguments the exact gap result forecloses? [47][48][60][49][12][13]
- Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions. ICML 2025's 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits these conditions — do they hold broadly enough to constitute a useful theory, or do they require fine-tuned assumptions that fail in realistic multi-layer, non-linear settings? [61][62][63][64][16][65][66]
Sources
- [1] NeurIPS Oral Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
- [2] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
- [3] [PDF] Dynamical decoupling of generalization and overfitting in large two ... — reactive:deep-learning-theory-limits
- [4] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [5] A Dynamical Theory of Overfitting and Generalization in Large Two-Layer Networks — reactive:deep-learning-theory-limits
- [6] Dynamical Decoupling of Generalization and Overfitting in Large... — reactive:deep-learning-theory-limits
- [7] Is Memorization Actually Necessary for Generalization - OpenReview — reactive:deep-learning-theory-limits
- [8] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
- [9] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
- [10] [PDF] WHEN MEMORIZATION HURTS GENERALIZATION — reactive:deep-learning-theory-limits
- [11] Memorization in Deep Learning: A Survey - ACM Digital Library — reactive:deep-learning-theory-limits
- [12] [PDF] Exact Gap between Generalization Error and Uniform Convergence ... — reactive:deep-learning-theory-limits
- [13] [2103.04554] Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models — reactive:deep-learning-theory-limits
- [14] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
- [15] NeurIPS Poster Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
- [16] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
- [17] ICLR Poster Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
- [18] Generalization v.s. Memorization: Tracing Language Models'... — reactive:deep-learning-theory-limits
- [19] [PDF] generalization v.s. memorization - arXiv — reactive:deep-learning-theory-limits
- [20] For Better or for Worse, Transformers Seek Patterns for Memorization | OpenReview — reactive:deep-learning-theory-limits
- [21] Theory and Mechanism of Large Language Models - arXiv — reactive:deep-learning-theory-limits
- [22] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
- [23] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
- [24] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
- [25] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
- [26] The paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
- [27] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
- [28] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
- [29] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
- [30] Daniel Kunin — reactive:deep-learning-theory-limits
- [31] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
- [32] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [33] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [34] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
- [35] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [36] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [37] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
- [38] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
- [39] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
- [40] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
- [41] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
- [42] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
- [43] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
- [44] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
- [45] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
- [46] Does Learning Require Memorization? A Short Tale About A Long Tail — reactive:deep-learning-theory-limits
- [47] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
- [48] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
- [49] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
- [50] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
- [51] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
- [52] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
- [53] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
- [54] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
- [55] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
- [56] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
- [57] Stability (learning theory) — reactive:deep-learning-theory-limits
- [58] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [59] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [60] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
- [61] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
- [62] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
- [63] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
- [64] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
- [65] NeurIPS Benign Overfitting in Out-of-Distribution Generalization of Linear Models — reactive:deep-learning-theory-limits
- [66] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
- [67] Uniform convergence may be unable to explain generalization in ... — reactive:deep-learning-theory-limits
- [68] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [69] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
- [70] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
- [71] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
- [72] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits