Deep Learning Theory Is Broken — And Maybe Unfixable · history

Version 6

2026-04-30 21:09 UTC · 169 items

Narrative

The thread's discourse structure remains stable through this cycle: LawrenceC's three-part Alignment Forum series (published April 25–27, 2026[1][2][3]) continues to organize the wider conversation, the Montanari-Urbani dynamical decoupling paper retains its standing as the most technically accomplished positive result (NeurIPS 2025 Oral[4]), and the memorization debate has not resolved. The new item batch introduces no factual corrections to prior synthesis but does add contextual confirmation of the field's breadth: the ACM Communications canonical publication of Zhang et al.'s rethinking-generalization result[5] signals that the 2016 finding has achieved textbook permanence rather than remaining a contested empirical claim, and a Stanford Mathematics department events page for Urbani's seminar on generalization in two-layer networks[6] confirms the Montanari-Urbani result is circulating in top-tier pure-math venues, not just ML conferences. A Medium commentary titled 'Learning Mechanics and the Second Formation of Deep Learning Theory'[7] introduces a secondary voice more sympathetic to Simon and Kunin's program than LawrenceC — framing learning mechanics not as a failed or incomplete project but as a genuine paradigm transition, a 'second formation' analogous to statistical mechanics succeeding Newtonian mechanics in physics. This framing is more ambitious than Simon and Kunin's own podcast claims[8] and sits in direct tension with LawrenceC's skepticism about practical utility[3].

The memorization cluster has not moved since last cycle. At least three independent OpenReview submissions contest Feldman's necessity claim[9][10][11], the ICLR 2025 'When Memorization Hurts Generalization' paper[12] remains the most adversarial peer-reviewed challenge, and the ACM survey on memorization in deep learning[13] marks the topic's elevation to first-class research object. The Yang et al. exact algebraic gap result[14][15] continues to stand as the strongest formal complement to Nagarajan-Kolter's empirical and PAC argument[16][17]: not merely that current bounds are loose, but that uniform convergence is provably incapable of closing the gap in random feature models. Whether this extends to transformer-scale nonlinear architectures remains open. The NeurIPS 2025 diffusion memorization paper[18][19] and ICML 2025 benign overfitting revisitation[20] continue to populate the positive-theory branch without resolving the central tensions.

The overall shape of the debate is now clearly bifurcated. One branch — dynamical decoupling[21][22], implicit dynamical regularization[18], algorithmic stability[23][24][25][26], and benign overfitting[27][28][20] — is building positive theory around training dynamics and algorithm-dependent arguments, directly addressing Nagarajan-Kolter's constraint that any valid theory must be algorithm- and data-dependent. The other branch is contesting the empirical foundations of memorization-as-necessary, which had been the most compelling reinterpretation of Zhang et al.'s original negative result. LawrenceC's series has placed both branches in public view for a non-specialist audience, and the secondary commentary (Medium[7], LessWrong mirrors[29], Alignment Forum[30]) confirms the discourse is spreading beyond ML researchers to the broader rationalist and AI-safety communities where theory failures are seen as having direct consequences for alignment.

Timeline

2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [2]
2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [1][16][17][69][70]
2019-06-01: Feldman publishes 'Does Learning Require Memorization? A Short Tale about a Long Tail,' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions. [44][48][71][49][50][51][52]
2021-01-01: Yang et al. establish an exact algebraic gap between generalization error and the tightest possible uniform convergence bound in random feature models, giving Nagarajan-Kolter a precise formal complement. [14][15]
2021-01-01: ACM Communications publishes the canonical journal version of Zhang et al.'s rethinking-generalization result, cementing it as a textbook-permanent finding rather than a contested empirical claim. [5]
2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to actually explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [53][54][55]
2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks using a mean-field view offers a new structural lens on why overparameterization does not prevent generalization. [72][73]
2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales. [74]
2025-02-19: Pierfrancesco Urbani (CNRS) delivers an invited talk at the Simons Institute (Berkeley) on 'Generalization and overfitting in two-layer neural networks,' extending the dynamical decoupling result's reach to a major theory venue. [39]
2025-01-01: Urbani delivers a seminar at the Stanford Mathematics department on generalization in two-layer neural networks, confirming the Montanari-Urbani result is circulating in pure-math venues beyond ML conferences. [6]
2025-01-01: ICLR 2025 paper 'When Memorization Hurts Generalization' argues memorization can actively damage generalization performance — a stronger claim than merely saying memorization is unnecessary. [12]
2025-01-01: NeurIPS 2025 Oral 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) presents results separating generalization dynamics from overfitting dynamics in the large two-layer regime; oral status confirmed, placing it in the top ~1-2% of NeurIPS 2025 submissions. [34][35][21][22][36][37][38][4][40][41][42][43]
2025-01-01: NeurIPS 2025 paper 'Why Diffusion Models Don't Memorize' attributes diffusion models' resistance to memorization to implicit dynamical regularization during training, connecting training dynamics to memorization suppression. [18][19]
2025-01-01: ICML 2025 poster 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits the conditions under which interpolating classifiers can generalize in the two-layer setting. [20]
2025-01-01: ICLR 2025 poster 'Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data' extends the memorization-generalization debate to LLMs, examining how specific pretraining data drives capability versus memorization. [60][61][62]
2026-04-24: Jamie Simon and Daniel Kunin (UC Berkeley) appear on Imbue's podcast arguing that a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [8]
2026-04-25: LawrenceC publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [3]
2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [2][30]
2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization; the post is crossposted to LessWrong and drives renewed interest in the original paper. [1][29]
2026-04-30: At least three distinct OpenReview submissions titled 'Is Memorization Actually Necessary for Generalization?' appear, representing independent formal challenges to Feldman's affirmative claim. [45][9][10][11][46]

Perspectives

LawrenceC (Alignment Forum / LessWrong)

Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling, explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.

Evolution: Consistent across all three posts in the series. His framing continues to organize the broader discourse; both the Alignment Forum and LessWrong mirrors confirm ongoing amplification into the AI-safety community.

[1][2][3][29][30]

Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)

A scientific theory of deep learning is achievable; their 'learning mechanics' framework, grounded in average-case training dynamics and aggregate statistics, is the right approach. Publicly promoted via Imbue podcast (April 24, 2026), YouTube presentation, and ongoing Twitter activity.

Evolution: Consistent; no direct public response to LawrenceC's critique yet apparent.

[8][31][32][33]

Independent Medium commentary ('second formation' framing)

Learning mechanics represents a genuine paradigm transition — a 'second formation' of deep learning theory analogous to statistical mechanics succeeding classical mechanics — not merely an incremental research program.

Evolution: New voice this cycle. More optimistic than Simon and Kunin's own public claims and directly contradicts LawrenceC's skepticism about practical utility. Represents independent amplification of the learning mechanics framing in a non-specialist venue.

[7]

Montanari and Urbani (dynamical decoupling)

In the large two-layer network regime, generalization dynamics and overfitting dynamics are separable — a result that, if it extends to deeper networks, would provide an algorithm-dependent structural account of why networks generalize despite interpolating training data.

Evolution: Previously cited as a 'NeurIPS 2025 poster.' Corrected in prior cycle: the paper received NeurIPS 2025 Oral status. This cycle adds Stanford Mathematics department seminar as a further venue, confirming cross-disciplinary reach into pure mathematics.

[34][35][21][22][36][37][38][39][4][40][41][42][43][6]

Vitaly Feldman (memorization is necessary)

Memorization of tail examples is causally necessary for learning from long-tailed distributions. This reframes Zhang et al.'s result: memorization is part of learning, not evidence that theory is broken.

Evolution: Previously cited as the dominant reinterpretation of Zhang et al. Now under sustained formal challenge: at least three OpenReview submissions contest the necessity claim, and ICLR 2025's 'When Memorization Hurts Generalization' argues memorization can actively hurt performance. The debate has not resolved.

[44][45][9][10][12][11][46][47][48][49][50][51][52]

ICLR 2025 'When Memorization Hurts Generalization' authors

Memorization is not merely unnecessary for generalization but can actively damage it — the strongest anti-Feldman position yet to appear in a peer-reviewed venue.

Evolution: Consistent from prior cycle. Represents a qualitative escalation from 'memorization is not required' to 'memorization is harmful.'

[12]

Yang et al. 2021 (exact gap, uniform convergence)

The failure of uniform convergence in random feature models is not an artifact of loose bounds but is provably exact — there is a measurable algebraic gap between any uniform convergence bound and the true generalization error, even in principle.

Evolution: Consistent from prior cycle. Formalizes and strengthens Nagarajan-Kolter's 2019 empirical and PAC argument by providing an exact characterization of the shortfall.

[14][15]

NeurIPS 2025 diffusion model memorization researchers

Implicit dynamical regularization during training prevents memorization in diffusion models — a training-dynamics explanation for a phenomenon that would otherwise require architectural or data-geometric accounts.

Evolution: Consistent from prior cycle. Thematically bridges the dynamical decoupling program and the memorization debate.

[18][19]

PAC-Bayes compression bounds researchers (NeurIPS 2022)

PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.

Evolution: Consistent; no direct engagement with Yang et al.'s exact gap result yet apparent.

[53][54][55]

Algorithmic stability / SGD stability researchers

Generalization in deep learning can be explained through the stability properties of SGD. This approach is inherently algorithm- and data-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias.

Evolution: Consistent; a substantial cluster of papers spanning fine-grained stability, Markov-chain SGD, momentum under heavy-tailed noise, and nonsmooth losses continues to develop independently.

[23][56][24][25][26][57][58][59]

ICLR 2025 LLM memorization researchers

Language models' generalization capabilities can be traced back to specific pretraining data, implying a causal data-memorization link that extends Feldman's long-tail thesis to the LLM regime.

Evolution: Consistent from prior cycle. Extends the memorization-generalization debate into transformer-scale models.

[60][61][62][63]

Tensions

Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others (especially what specific algorithms individual networks learn) permanently outside its scope? The new 'second formation' framing from independent commentary sharpens this: is the analogy to statistical mechanics apt, or does it overclaim? [3][8][31][32][33][7]
Is memorization causally necessary, causally harmful, or merely correlated with generalization? Feldman's affirmative claim is now contested by at least three OpenReview submissions and directly contradicted by ICLR 2025's 'When Memorization Hurts Generalization' — the field has not converged on whether these results apply to different regimes, different definitions of memorization, or are genuinely contradictory. [2][44][45][9][10][12][11][46]
Yang et al. 2021 establish an exact algebraic gap between generalization error and uniform convergence in random feature models. Does this exact gap result extend to the non-linear, multi-layer transformer regime, and if so, does it permanently close off the uniform convergence program for modern LLMs? [1][14][15][16][17]
The NeurIPS 2025 dynamical decoupling result (confirmed Oral, presented at both Simons Institute and Stanford Mathematics) establishes that generalization and overfitting are separable phenomena in large two-layer networks. Does this separation persist in deeper networks and in transformer architectures, and if so, does it support or complicate the learning mechanics program's focus on aggregate training dynamics? [34][35][22][38][39][4][40][41][42][6]
If implicit dynamical regularization explains why diffusion models don't memorize, does the same mechanism operate in transformers and other architectures? And if training dynamics suppress memorization in some architectures, why does memorization still appear necessary or harmful in others — suggesting the memorization debate may be architecture-regime-specific rather than universal. [18][19][63][60][20]
PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while Yang et al.'s exact gap result says uniform convergence-style arguments are provably incapable of capturing the correct quantity. Are PAC-Bayes compression bounds a genuine escape hatch from the Nagarajan-Kolter impossibility, or do they fall within the class of arguments the exact gap result forecloses? [53][54][64][55][14][15]
Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions. ICML 2025's 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits these conditions — do they hold broadly enough to constitute a useful theory, or do they require fine-tuned assumptions that fail in realistic multi-layer, non-linear settings? [27][65][28][66][20][67][68]

Sources

[1] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
[2] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
[3] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
[4] NeurIPS Oral Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[5] Understanding Deep Learning (Still) Requires Rethinking ... — reactive:deep-learning-theory-limits
[6] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
[7] Learning Mechanics and the Second Formation of Deep Learning ... — reactive:deep-learning-theory-limits
[8] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
[9] Is Memorization Actually Necessary for Generalization - OpenReview — reactive:deep-learning-theory-limits
[10] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
[11] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
[12] [PDF] WHEN MEMORIZATION HURTS GENERALIZATION — reactive:deep-learning-theory-limits
[13] Memorization in Deep Learning: A Survey - ACM Digital Library — reactive:deep-learning-theory-limits
[14] [PDF] Exact Gap between Generalization Error and Uniform Convergence ... — reactive:deep-learning-theory-limits
[15] [2103.04554] Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models — reactive:deep-learning-theory-limits
[16] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[17] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[18] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
[19] NeurIPS Poster Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
[20] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
[21] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[22] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
[23] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[24] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[25] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
[26] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
[27] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
[28] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
[29] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
[30] The paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
[31] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
[32] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
[33] Daniel Kunin — reactive:deep-learning-theory-limits
[34] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[35] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[36] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[37] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[38] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
[39] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
[40] A Dynamical Theory of Overfitting and Generalization in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[41] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[42] [PDF] Dynamical decoupling of generalization and overfitting in large two ... — reactive:deep-learning-theory-limits
[43] Dynamical Decoupling of Generalization and Overfitting in Large... — reactive:deep-learning-theory-limits
[44] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
[45] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
[46] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
[47] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
[48] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
[49] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
[50] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
[51] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
[52] Does Learning Require Memorization? A Short Tale About A Long Tail — reactive:deep-learning-theory-limits
[53] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
[54] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[55] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[56] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[57] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
[58] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
[59] Stability (learning theory) — reactive:deep-learning-theory-limits
[60] ICLR Poster Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
[61] Generalization v.s. Memorization: Tracing Language Models'... — reactive:deep-learning-theory-limits
[62] [PDF] generalization v.s. memorization - arXiv — reactive:deep-learning-theory-limits
[63] For Better or for Worse, Transformers Seek Patterns for Memorization | OpenReview — reactive:deep-learning-theory-limits
[64] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
[65] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
[66] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
[67] NeurIPS Benign Overfitting in Out-of-Distribution Generalization of Linear Models — reactive:deep-learning-theory-limits
[68] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
[69] Uniform convergence may be unable to explain generalization in ... — reactive:deep-learning-theory-limits
[70] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[71] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
[72] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[73] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[74] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits