Deep Learning Theory Is Broken — And Maybe Unfixable · history

Version 8

2026-05-01 13:04 UTC · 243 items

Narrative

The discourse this cycle crosses a threshold: the April 2026 preprint "There Will Be a Scientific Theory of Deep Learning" has moved from mere indexing to active community engagement. A LessWrong "Quick Paper Review" now appears on the same platform where LawrenceC's foundational critique posts live[1], and a Reddit r/MachineLearning thread documents the broader ML research community's reception[2]. A YouTube presentation associated with the preprint[3] and multiple aggregation sites confirm wide circulation within weeks of its late-April 2026 submission[4][5][6]. The LessWrong hosting is the structurally significant development: it places the formal academic optimistic rebuttal directly in LawrenceC's home venue, creating a confrontation between the two framings within the same community platform and closing a circuit that had been open since the paper's submission.

In parallel, mechanistic interpretability has matured from institutional recognition to self-organizing research infrastructure. Beyond MIT's 2026 Breakthrough designation and the ACM survey tracked in the previous cycle, a dedicated ICML 2026 workshop[7] and a GitHub gist "Open problems in mechanistic interpretability: 2026 status report"[8] document construction of community apparatus — program committees, curated open problem taxonomies — that statistical generalization theory never developed at a comparable stage. MIT News coverage of research on improving AI models' ability to explain their predictions[9] signals science-communication investment in interpretability that the generalization theory program lacks entirely. The contrast is now structurally visible and widening: mechanistic interpretability has a tier-1 conference workshop, a public open problems list, institutional press coverage, and growing educational infrastructure, while the generalization theory program operates through technical preprints and conference papers without equivalent organizational scaffolding.

Two new angles complicate the memorization sub-debate. "Training Data Pruning Improves Memorization of Facts" (arxiv 2604.08519, April 2026)[10] presents a counterintuitive finding: selective reduction of training data can improve how well models memorize individual facts. This implies data redundancy may suppress rather than reinforce fact memorization — a challenge to the frequency-driven account implicit in much of the Feldman necessity debate and in discussions of why long-tail examples require memorization. The OpenReview paper "What is the role of memorization in Continual Learning?"[11] extends the debate to multi-task temporal settings, where the question is not memorize-vs-generalize but memorize-without-forgetting-across-tasks — a failure mode with distinct structure from the single-task Feldman formulation. A NeurIPS 2025 poster on understanding the evolution of the Neural Tangent Kernel[12] also enters the synthesis for the first time: by tracking how the NTK changes during training rather than treating it as a fixed linearization, this work engages the "lazy training vs. feature learning" distinction that learning mechanics treats as the core diagnostic for whether classical theory can capture deep network behavior.

The thread's overall shape is clarifying. Founding pessimism — Zhang et al.'s result[13] amplified by Nagarajan-Kolter[14] — now faces a formal academic rebuttal[15] that has reached the communities that originally broadcast the pessimism, with the LessWrong review[1] completing a circuit back to the source. Mechanistic interpretability has enough infrastructure to constitute a genuine parallel paradigm, not merely a competing research interest. The memorization debate, having accumulated grokking temporal dynamics, compositional mechanisms, differential privacy implications, continual learning extensions, and training data pruning counterintuitions, is approaching a point where "memorization" may name a family of distinct phenomena — with different mechanisms, different implications for theory, and different empirical profiles — rather than a single unified claim about the relationship between learning and generalization.

Timeline

2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [13]
2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [14][97][98][107][108]
2019-06-01: Feldman publishes 'Does Learning Require Memorization? A Short Tale about a Long Tail,' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions. [42][50][109][51][52][53][54]
2020-07-01: Negrea et al. publish 'In Defense of Uniform Convergence' at ICML 2020, arguing that uniform convergence can be partially recovered through derandomization applied to interpolating classifiers — a counterpoint to Nagarajan-Kolter previously untracked in this synthesis. [61][62][63]
2021-01-01: Yang et al. establish an exact algebraic gap between generalization error and the tightest possible uniform convergence bound in random feature models, giving Nagarajan-Kolter a precise formal complement. [64][65]
2021-01-01: ACM Communications publishes the canonical journal version of Zhang et al.'s rethinking-generalization result, cementing it as a textbook-permanent finding rather than a contested empirical claim. [110]
2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to actually explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [72][73][74]
2022-01-01: NeurIPS 2022 paper 'Towards Understanding Grokking: An Effective Theory of Representation Learning' provides an effective-theory account of delayed generalization, framing grokking as a consequence of representation learning dynamics. [94]
2023-07-01: Wu et al. publish 'The Implicit Regularization of Dynamical Stability in SGD' at ICML 2023, showing that SGD's dynamical stability provides an implicit regularizer that suppresses memorization — bridging the algorithmic stability and implicit regularization research strands. [66][67]
2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks using a mean-field view offers a new structural lens on why overparameterization does not prevent generalization. [111][112]
2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales. [113]
2025-01-01: Urbani delivers a seminar at the Stanford Mathematics department on generalization in two-layer neural networks, confirming the Montanari-Urbani result is circulating in pure-math venues beyond ML conferences. [41]
2025-01-01: ICLR 2025 paper 'When Memorization Hurts Generalization' argues memorization can actively damage generalization performance — a stronger claim than merely saying memorization is unnecessary. [46][58][59][60]
2025-01-01: NeurIPS 2025 Oral 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) presents results separating generalization dynamics from overfitting dynamics in the large two-layer regime; oral status confirmed, placing it in the top ~1-2% of NeurIPS 2025 submissions. [28][29][30][31][32][33][34][36][37][38][39][40]
2025-01-01: NeurIPS 2025 paper 'Why Diffusion Models Don't Memorize' attributes diffusion models' resistance to memorization to implicit dynamical regularization during training, connecting training dynamics to memorization suppression. [68][69][70][71]
2025-01-01: ICML 2025 poster 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits the conditions under which interpolating classifiers can generalize in the two-layer setting. [99]
2025-01-01: ICLR 2025 poster 'Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data' extends the memorization-generalization debate to LLMs, examining how specific pretraining data drives capability versus memorization. [89][90][91][93]
2025-01-01: 'Necessary Memorization in Overparameterized Learning under Long-Tailed Mixture Models: Theory and Privacy Implications' provides formal theoretical support for Feldman's necessity claim in a specific mathematical setting, and introduces differential privacy leakage as a downstream consequence of memorization. [55]
2025-01-01: NeurIPS 2025 poster 'Understanding the Evolution of the Neural Tangent Kernel' tracks how the NTK changes during training rather than treating it as a fixed linearization, engaging the 'lazy training vs. feature learning' distinction central to the learning mechanics critique of classical theory. [12]
2025-02-19: Pierfrancesco Urbani (CNRS) delivers an invited talk at the Simons Institute (Berkeley) on 'Generalization and overfitting in two-layer neural networks,' extending the dynamical decoupling result's reach to a major theory venue. [35]
2025-10-01: 'Memorizing Long-tail Data Can Help Generalization Through Composition' (arxiv 2510.16322) proposes that the mechanism by which tail memorization aids generalization is compositional rather than coverage-based, offering a mechanistic refinement of Feldman's thesis. [56][57]
2026-01-01: MIT names mechanistic interpretability as its 2026 Breakthrough of the Year, institutionally recognizing circuit-level understanding as a viable alternative program to statistical generalization theory for understanding deep learning. [83][87][88][114]
2026-01-01: A dedicated ICML 2026 workshop on mechanistic interpretability is announced, and an 'Open problems in mechanistic interpretability: 2026 status report' is published as a GitHub gist — signaling the field's transition from institutional recognition to self-organizing research infrastructure with program committees and curated open problem taxonomies. [7][8]
2026-03-01: MIT News covers research on improving AI models' ability to explain their predictions, marking science-communication investment in interpretability research that the statistical generalization theory program has not received. [9]
2026-04-01: 'Grokking as Dimensional Phase Transition in Neural Networks' (arxiv 2604.04655) introduces a new theoretical framework for delayed generalization, framing the memorization-to-generalization transition as a phase transition in the dimensionality of learned representations. [95]
2026-04-01: 'Training Data Pruning Improves Memorization of Facts' (arxiv 2604.08519) presents a counterintuitive April 2026 finding: selective reduction of training data can improve how well models memorize individual facts, suggesting data redundancy may suppress rather than reinforce fact memorization. [10]
2026-04-24: Jamie Simon and Daniel Kunin (UC Berkeley) appear on Imbue's podcast arguing that a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [21]
2026-04-25: LawrenceC publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [16]
2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [13][18]
2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization; the post is crossposted to LessWrong and drives renewed interest in the original paper. [14][17][19]
2026-04-28: 'There Will Be a Scientific Theory of Deep Learning' (arxiv 2604.21691) appears as a late-April 2026 preprint offering the first formal academic-paper-level optimistic response to the thread's founding pessimism, its title directly inverting LawrenceC's framing. [15][25]
2026-04-30: At least three distinct OpenReview submissions titled 'Is Memorization Actually Necessary for Generalization?' appear, representing independent formal challenges to Feldman's affirmative claim. [43][44][45][47][48]
2026-05-01: A LessWrong 'Quick Paper Review' of 'There Will Be a Scientific Theory of Deep Learning' appears on the same platform as LawrenceC's original critique posts, and a Reddit r/MachineLearning thread opens discussion of the preprint — marking its transition from indexed to actively engaged within the communities that originally amplified the pessimistic framing. [1][2]

Perspectives

LawrenceC (Alignment Forum / LessWrong) — confirmed as Lawrence Chan

Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling (μP), explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.

Evolution: Consistent across all three posts. This cycle: the preprint 'There Will Be a Scientific Theory of Deep Learning' has now received a LessWrong review in LawrenceC's home venue, creating a direct confrontation between his framing and the academic optimistic response within the same community platform.

[14][13][16][17][18][19][20]

Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)

A scientific theory of deep learning is achievable; their 'learning mechanics' framework, grounded in average-case training dynamics and aggregate statistics, is the right approach. Publicly promoted via Imbue podcast (April 24, 2026), YouTube presentation, and ongoing Twitter activity.

Evolution: Consistent; no direct public response to LawrenceC's critique yet apparent. The NeurIPS 2025 NTK evolution poster is consistent with the feature-learning vs. lazy-training axis that learning mechanics treats as central, suggesting adjacent theoretical work is converging on the same diagnostic.

[21][22][23][24][12]

'There Will Be a Scientific Theory of Deep Learning' authors (April 2026 preprint)

A scientific theory of deep learning is achievable — the title directly inverts the pessimistic framing of LawrenceC's series and the thread's organizing premise. The formal academic-paper format marks this as the first non-blog-post optimistic response in this cycle.

Evolution: Previously not yet receiving indexed commentary. This cycle: a LessWrong 'Quick Paper Review' and a Reddit r/MachineLearning thread confirm active community engagement; a YouTube presentation circulates alongside the paper. The preprint has moved from indexed to actively discussed within the communities it was implicitly addressing.

[15][4][25][5][26][3][6][1][2]

Independent Medium commentary ('second formation' framing)

Learning mechanics represents a genuine paradigm transition — a 'second formation' of deep learning theory analogous to statistical mechanics succeeding classical mechanics — not merely an incremental research program.

Evolution: Consistent from prior cycle. More optimistic than Simon and Kunin's own public claims and directly contradicts LawrenceC's skepticism about practical utility.

[27]

Montanari and Urbani (dynamical decoupling)

In the large two-layer network regime, generalization dynamics and overfitting dynamics are separable — a result that, if it extends to deeper networks, would provide an algorithm-dependent structural account of why networks generalize despite interpolating training data.

Evolution: Consistent from prior cycles. NeurIPS 2025 Oral status and cross-disciplinary reach (Simons Institute, Stanford Mathematics) confirmed in prior cycle; no new developments this cycle.

[28][29][30][31][32][33][34][35][36][37][38][39][40][41]

Vitaly Feldman (memorization is necessary)

Memorization of tail examples is causally necessary for learning from long-tailed distributions. This reframes Zhang et al.'s result: memorization is part of learning, not evidence that theory is broken.

Evolution: Previously under challenge from three OpenReview submissions, ICLR 2025's 'When Memorization Hurts Generalization,' with partial formal corroboration under specific distributional assumptions and a compositional mechanism account. This cycle: training data pruning may suppress memorization of individual facts (complicating the frequency-driven account), and the continual learning extension raises whether memorization necessity holds across task boundaries — adding new sub-cases without resolving the core claim.

[42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57][10][11]

ICLR 2025 'When Memorization Hurts Generalization' authors

Memorization is not merely unnecessary for generalization but can actively damage it — the strongest anti-Feldman position yet to appear in a peer-reviewed venue.

Evolution: Consistent from prior cycle.

[46][58][59][60]

Negrea et al. (defense of uniform convergence via derandomization)

Uniform convergence arguments can be partially recovered through derandomization techniques applied to interpolating classifiers, suggesting the Nagarajan-Kolter impossibility is not a blanket closure of the uniform convergence program.

Evolution: First tracked in prior cycle. This cycle: ICML 2020 proceedings URL and arxiv abstract confirmed via additional indexing. Whether this result engages or sidesteps Yang et al.'s exact gap result remains unresolved.

[61][62][63]

Yang et al. 2021 (exact gap, uniform convergence)

The failure of uniform convergence in random feature models is not an artifact of loose bounds but is provably exact — there is a measurable algebraic gap between any uniform convergence bound and the true generalization error, even in principle.

Evolution: Consistent from prior cycle. The newly confirmed Negrea et al. defense of uniform convergence via derandomization creates an unresolved question about whether the exact gap forecloses derandomized arguments.

[64][65]

Wu et al. ICML 2023 (implicit regularization of dynamical stability, SGD)

SGD's dynamical stability provides an implicit regularizer that suppresses memorization and promotes generalization — a mechanistic account connecting training-dynamics arguments to the diffusion models result and the dynamical decoupling program.

Evolution: Consistent from prior cycle.

[66][67]

NeurIPS 2025 diffusion model memorization researchers

Implicit dynamical regularization during training prevents memorization in diffusion models — a training-dynamics explanation for a phenomenon that would otherwise require architectural or data-geometric accounts.

Evolution: Consistent from prior cycle.

[68][69][70][71]

PAC-Bayes compression bounds researchers (NeurIPS 2022)

PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.

Evolution: Consistent; no direct engagement with Yang et al.'s exact gap result or Negrea et al.'s derandomization argument yet apparent.

[72][73][74]

Algorithmic stability / SGD stability researchers

Generalization in deep learning can be explained through the stability properties of SGD. This approach is inherently algorithm- and data-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias.

Evolution: Consistent; Wu et al. ICML 2023 bridges this cluster to the implicit regularization and diffusion memorization strands, providing a shared mechanistic account.

[75][76][77][78][79][80][81][82][66]

Mechanistic interpretability community (MIT 2026 Breakthrough)

Understanding deep learning through circuit-level mechanistic analysis of what algorithms individual networks implement is a viable — and now institutionally recognized — alternative to statistical generalization theory.

Evolution: Previously anchored by MIT's Breakthrough designation and an ACM survey. This cycle: a dedicated ICML 2026 workshop and a 2026 open problems status report document the transition from recognition to self-organizing research infrastructure. MIT News coverage of AI explanation research signals science-communication investment that the generalization theory program lacks.

[83][84][85][86][87][88][7][8][9]

ICLR 2025 LLM memorization researchers

Language models' generalization capabilities can be traced back to specific pretraining data, implying a causal data-memorization link that extends Feldman's long-tail thesis to the LLM regime.

Evolution: Consistent from prior cycle.

[89][90][91][92][93]

NTK evolution researchers (NeurIPS 2025)

The Neural Tangent Kernel is not static during training — understanding how it evolves during training connects the classical NTK linearization program to the feature learning regime that learning mechanics treats as the operative one in practice.

Evolution: New voice this cycle, entering via a NeurIPS 2025 poster. The NTK program had been absent from prior synthesis rounds despite being a major theoretical framework. Its appearance focused on NTK evolution rather than static NTK suggests the program is responding to the feature-learning critique rather than defending the original lazy-training framing.

[12]

Tensions

Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others permanently outside its scope? The April 2026 preprint 'There Will Be a Scientific Theory of Deep Learning' argues the question is not closed and now has active community engagement in the LessWrong venue where LawrenceC's pessimism originated. Whether the NTK evolution program — tracking kernel dynamics during training rather than treating it as fixed — provides supporting evidence for learning mechanics' feature-learning framing or constitutes a separate line of theoretical development is a live sub-question within this broader tension. [16][21][22][23][24][27][15][1][2][12]
Is memorization causally necessary, causally harmful, or merely correlated with generalization — and does the answer change across training epochs, architectures, and task settings? Grokking research shows networks can memorize first and generalize later. Feldman's necessity claim is being formalized under long-tailed mixture models and a compositional mechanism account, while 'When Memorization Hurts Generalization' claims the opposite. A counterintuitive April 2026 finding that training data pruning can improve memorization of individual facts suggests data redundancy may suppress rather than reinforce fact memorization. The continual learning extension raises whether necessity holds across task boundaries. These positions may not be contradictory if memorization's effects depend on training epoch, regime, and task structure. [42][43][44][45][46][47][48][58][59][60][55][56][57][94][95][96][10][11]
Yang et al. 2021 establish an exact algebraic gap between generalization error and uniform convergence in random feature models. Negrea et al.'s ICML 2020 'In Defense of Uniform Convergence' argues the framework can be salvaged through derandomization applied to interpolating classifiers. Do derandomized uniform convergence arguments fall within the class foreclosed by the exact gap result, or do they constitute a genuine escape hatch that keeps the classical program viable? [14][64][65][97][98][61][62][63]
The NeurIPS 2025 dynamical decoupling result establishes that generalization and overfitting are separable phenomena in large two-layer networks. Does this separation persist in deeper networks and in transformer architectures, and if so, does it support or complicate the learning mechanics program's focus on aggregate training dynamics? [28][29][31][34][35][36][37][38][39][41]
If implicit dynamical regularization explains why diffusion models don't memorize, and Wu et al.'s SGD dynamical stability result provides a parallel mechanism, is there a unified dynamical account of memorization suppression across architectures? Or does the memorization debate remain architecture-regime-specific, with separate mechanisms governing transformers, diffusion models, and two-layer networks? [68][69][70][71][66][92][89][99]
PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while Yang et al.'s exact gap result says uniform convergence-style arguments are provably incapable of capturing the correct quantity. Are PAC-Bayes compression bounds a genuine escape hatch from the Nagarajan-Kolter impossibility, or do they fall within the class of arguments the exact gap result forecloses? [72][73][100][74][64][65]
Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions. ICML 2025's 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits these conditions — do they hold broadly enough to constitute a useful theory, or do they require fine-tuned assumptions that fail in realistic multi-layer, non-linear settings? [101][102][103][104][99][105][106]
Is mechanistic interpretability a practical alternative to statistical generalization theory for understanding deep learning, or does its circuit-level focus simply answer a different question? MIT's 2026 Breakthrough designation, a dedicated ICML 2026 workshop, and a public open problems status report have given mechanistic interpretability institutional infrastructure that statistical generalization theory lacks — but institutional success is not the same as theoretical displacement, and the two programs have not yet engaged each other directly. [83][84][85][86][16][21][7][8][87]

Sources

[1] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — LessWrong — reactive:deep-learning-theory-limits
[2] There Will Be a Scientific Theory of Deep Learning [R] - Reddit — reactive:deep-learning-theory-limits
[3] There Will Be a Scientific Theory of Deep Learning - YouTube — reactive:deep-learning-theory-limits
[4] There Will Be a Scientific Theory of Deep Learning | Cool Papers — reactive:deep-learning-theory-limits
[5] There Will Be a Scientific Theory of Deep Learning | alphaXiv — reactive:deep-learning-theory-limits
[6] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
[7] Mechanistic Interpretability Workshop at ICML 2026 — reactive:deep-learning-theory-limits
[8] Open problems in mechanistic interpretability: 2026 status report - Gist — reactive:deep-learning-theory-limits
[9] Improving AI models' ability to explain their predictions | MIT News — reactive:deep-learning-theory-limits
[10] [PDF] Training Data Pruning Improves Memorization of Facts - arXiv — reactive:deep-learning-theory-limits
[11] What is the role of memorization in Continual Learning? | OpenReview — reactive:deep-learning-theory-limits
[12] Understanding the Evolution of the Neural Tangent Kernel at the ... — reactive:deep-learning-theory-limits
[13] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
[14] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
[15] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
[16] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
[17] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
[18] The paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
[19] The other paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
[20] Lawrence Chan — reactive:deep-learning-theory-limits
[21] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
[22] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
[23] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
[24] Daniel Kunin — reactive:deep-learning-theory-limits
[25] [2604.21691] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
[26] There Will Be a Scientific Theory of Deep Learning | Takara TLDR — reactive:deep-learning-theory-limits
[27] Learning Mechanics and the Second Formation of Deep Learning ... — reactive:deep-learning-theory-limits
[28] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[29] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[30] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[31] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
[32] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[33] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[34] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
[35] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
[36] NeurIPS Oral Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[37] A Dynamical Theory of Overfitting and Generalization in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[38] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[39] [PDF] Dynamical decoupling of generalization and overfitting in large two ... — reactive:deep-learning-theory-limits
[40] Dynamical Decoupling of Generalization and Overfitting in Large... — reactive:deep-learning-theory-limits
[41] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
[42] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
[43] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
[44] Is Memorization Actually Necessary for Generalization - OpenReview — reactive:deep-learning-theory-limits
[45] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
[46] [PDF] WHEN MEMORIZATION HURTS GENERALIZATION — reactive:deep-learning-theory-limits
[47] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
[48] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
[49] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
[50] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
[51] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
[52] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
[53] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
[54] Does Learning Require Memorization? A Short Tale About A Long Tail — reactive:deep-learning-theory-limits
[55] (PDF) Necessary Memorization in Overparameterized Learning ... — reactive:deep-learning-theory-limits
[56] Memorizing Long-tail Data Can Help Generalization Through ... — reactive:deep-learning-theory-limits
[57] Memorizing Long-tail Data Can Help Generalization Through Composition — reactive:deep-learning-theory-limits
[58] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
[59] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
[60] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
[61] [PDF] In Defense of Uniform Convergence: Generalization via ... — reactive:deep-learning-theory-limits
[62] [1912.04265] In Defense of Uniform Convergence - arXiv — reactive:deep-learning-theory-limits
[63] In Defense of Uniform Convergence: Generalization via Derandomization with an Application to Interpolating Predictors — reactive:deep-learning-theory-limits
[64] [PDF] Exact Gap between Generalization Error and Uniform Convergence ... — reactive:deep-learning-theory-limits
[65] [2103.04554] Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models — reactive:deep-learning-theory-limits
[66] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[67] [Quick Review] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[68] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
[69] NeurIPS Poster Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
[70] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
[71] Why diffusion models in generative AI don’t memorize: The role of implicit dynamical regularization in training | LPENS — reactive:deep-learning-theory-limits
[72] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
[73] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[74] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[75] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[76] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[77] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[78] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
[79] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
[80] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
[81] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
[82] Stability (learning theory) — reactive:deep-learning-theory-limits
[83] Mechanistic Interpretability Named MIT's 2026 Breakthrough for ... — reactive:deep-learning-theory-limits
[84] Bridging the Black Box: A Survey on Mechanistic Interpretability in AI — reactive:deep-learning-theory-limits
[85] Understanding Mechanistic Interpretability in AI Models - IntuitionLabs — reactive:deep-learning-theory-limits
[86] AI Safety, Alignment, and Interpretability in 2026 | Zylos Research — reactive:deep-learning-theory-limits
[87] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits
[88] MIT Technology Review's Post - Mechanistic interpretability - LinkedIn — reactive:deep-learning-theory-limits
[89] ICLR Poster Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
[90] Generalization v.s. Memorization: Tracing Language Models'... — reactive:deep-learning-theory-limits
[91] [PDF] generalization v.s. memorization - arXiv — reactive:deep-learning-theory-limits
[92] For Better or for Worse, Transformers Seek Patterns for Memorization | OpenReview — reactive:deep-learning-theory-limits
[93] [PDF] generalization v.s. memorization: tracing language models ... — reactive:deep-learning-theory-limits
[94] [PDF] Towards Understanding Grokking: An Effective Theory of ... — reactive:deep-learning-theory-limits
[95] Grokking as Dimensional Phase Transition in Neural Networks — reactive:deep-learning-theory-limits
[96] The complexity dynamics of grokking - ScienceDirect — reactive:deep-learning-theory-limits
[97] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[98] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[99] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
[100] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
[101] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
[102] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
[103] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
[104] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
[105] NeurIPS Benign Overfitting in Out-of-Distribution Generalization of Linear Models — reactive:deep-learning-theory-limits
[106] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
[107] Uniform convergence may be unable to explain generalization in ... — reactive:deep-learning-theory-limits
[108] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[109] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
[110] Understanding Deep Learning (Still) Requires Rethinking ... — reactive:deep-learning-theory-limits
[111] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[112] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[113] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits
[114] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits