Deep Learning Theory Is Broken — And Maybe Unfixable · history
Version 7
2026-05-01 04:59 UTC · 215 items
Narrative
The discourse enters a new phase this cycle with three clusters previously absent from this thread: grokking (delayed generalization after memorization), mechanistic interpretability, and the maximal update parameterization (μP) practitioner community. The grokking cluster — including an April 2026 preprint framing grokking as a dimensional phase transition in neural networks[1], a NeurIPS 2022 effective-theory account of the phenomenon[2], and a ScienceDirect paper on its complexity dynamics[3] — introduces a temporal dimension the memorization debate had largely elided. If networks routinely memorize training data first and only later, with continued training, develop generalizable representations, then the binary framing of "memorization vs. generalization" misses an essential third state: the transition from one to the other. This complicates both Feldman's static necessity claim and the ICLR 2025 argument that memorization hurts generalization[4][5][6] simultaneously: both propositions may hold at different training epochs and regimes, suggesting the debate may be less about the phenomenon and more about which phase of training is being analyzed.
Two previously missing perspectives enter the synthesis this cycle. First, Negrea et al.'s ICML 2020 paper "In Defense of Uniform Convergence"[7] — surfaced via reactive search but not previously tracked — argues that uniform convergence arguments can be partially salvaged through derandomization techniques applied to interpolating classifiers, offering a counterpoint to the Nagarajan-Kolter impossibility absent from prior synthesis. Whether derandomization constitutes a genuine escape hatch or merely a special-case construction that does not engage Yang et al.'s exact algebraic gap[8][9] remains unresolved. Second, an April 2026 arxiv preprint titled "There Will Be a Scientific Theory of Deep Learning"[10] provides the first formal academic-paper-level optimistic response to the thread's founding pessimism, its title directly inverting LawrenceC's framing[11][12]. Given its high arxiv submission number (2604.21691), the preprint was likely submitted in late April 2026 — possibly a direct response to LawrenceC's series — though it has not yet attracted indexed commentary. Within the memorization sub-debate, the picture narrows and mechanizes: "Necessary Memorization in Overparameterized Learning under Long-Tailed Mixture Models"[13] gives formal theoretical support for Feldman's claim in a specific mathematical setting while connecting memorization to differential privacy implications — a new downstream consequence not previously tracked — and "Memorizing Long-tail Data Can Help Generalization Through Composition"[14][15] proposes that the operative mechanism is compositional rather than mere coverage, a refinement that both rehabilitates and constrains Feldman's thesis.
The μP cluster materializes in practitioner-facing venues[16][17][18], extending to grouped-query attention architectures relevant to production LLMs[19]. LawrenceC had named hyperparameter transfer (μP's key benefit) as the primary practical output of the learning mechanics program, treating this as evidence of narrow theoretical reach[12]. The practitioner adoption now documented — EleutherAI's practitioner's guide, individual engineers' write-ups — confirms real-world impact but also shows μP being consumed as an engineering recipe rather than a theoretical lens, consistent with LawrenceC's critique rather than refuting it. Wu et al.'s ICML 2023 "The Implicit Regularization of Dynamical Stability in SGD"[20] bridges the algorithmic stability and implicit regularization strands: SGD's dynamical stability provides an implicit regularizer that suppresses memorization, connecting the dynamical decoupling program[21][22] to the diffusion models result[23][24] through a shared mechanistic account of why training dynamics prevent memorization without architectural constraints.
The mechanistic interpretability cluster — anchored by MIT's 2026 Breakthrough designation[25], an ACM survey on circuit-level interpretability[26], and growing educational infrastructure[27] — introduces a quiet background reframing of the full debate. The failure of statistical generalization theory matters less if circuit-level mechanistic understanding is succeeding as an alternative program. This is not a direct engagement with the generalization debate but is consistent with the AI safety community's amplification of LawrenceC's posts[28][29], and suggests the audience for this thread is beginning to identify a practical alternative paradigm: rather than asking when there will be a statistical theory of why networks generalize, asking what specific algorithms individual networks implement. Lawrence Chan's personal website[30] confirms his academic identity, adding biographical grounding to his role as the thread's central organizing voice and his position bridging the ML theory and AI safety communities.
Timeline
- 2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [11]
- 2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [31][83][84][93][94]
- 2019-06-01: Feldman publishes 'Does Learning Require Memorization? A Short Tale about a Long Tail,' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions. [51][59][95][60][61][62][63]
- 2020-07-01: Negrea et al. publish 'In Defense of Uniform Convergence' at ICML 2020, arguing that uniform convergence can be partially recovered through derandomization applied to interpolating classifiers — a counterpoint to Nagarajan-Kolter previously untracked in this synthesis. [7]
- 2021-01-01: Yang et al. establish an exact algebraic gap between generalization error and the tightest possible uniform convergence bound in random feature models, giving Nagarajan-Kolter a precise formal complement. [8][9]
- 2021-01-01: ACM Communications publishes the canonical journal version of Zhang et al.'s rethinking-generalization result, cementing it as a textbook-permanent finding rather than a contested empirical claim. [96]
- 2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to actually explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [67][68][69]
- 2022-01-01: NeurIPS 2022 paper 'Towards Understanding Grokking: An Effective Theory of Representation Learning' provides an effective-theory account of delayed generalization, framing grokking as a consequence of representation learning dynamics. [2]
- 2023-07-01: Wu et al. publish 'The Implicit Regularization of Dynamical Stability in SGD' at ICML 2023, showing that SGD's dynamical stability provides an implicit regularizer that suppresses memorization — bridging the algorithmic stability and implicit regularization research strands. [20][64]
- 2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks using a mean-field view offers a new structural lens on why overparameterization does not prevent generalization. [97][98]
- 2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales. [99]
- 2025-01-01: Urbani delivers a seminar at the Stanford Mathematics department on generalization in two-layer neural networks, confirming the Montanari-Urbani result is circulating in pure-math venues beyond ML conferences. [50]
- 2025-02-19: Pierfrancesco Urbani (CNRS) delivers an invited talk at the Simons Institute (Berkeley) on 'Generalization and overfitting in two-layer neural networks,' extending the dynamical decoupling result's reach to a major theory venue. [44]
- 2025-01-01: ICLR 2025 paper 'When Memorization Hurts Generalization' argues memorization can actively damage generalization performance — a stronger claim than merely saying memorization is unnecessary. [55][4][5][6]
- 2025-01-01: NeurIPS 2025 Oral 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) presents results separating generalization dynamics from overfitting dynamics in the large two-layer regime; oral status confirmed, placing it in the top ~1-2% of NeurIPS 2025 submissions. [39][40][21][22][41][42][43][45][46][47][48][49]
- 2025-01-01: NeurIPS 2025 paper 'Why Diffusion Models Don't Memorize' attributes diffusion models' resistance to memorization to implicit dynamical regularization during training, connecting training dynamics to memorization suppression. [65][66][23][24]
- 2025-01-01: ICML 2025 poster 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits the conditions under which interpolating classifiers can generalize in the two-layer setting. [85]
- 2025-01-01: ICLR 2025 poster 'Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data' extends the memorization-generalization debate to LLMs, examining how specific pretraining data drives capability versus memorization. [78][79][80][82]
- 2025-01-01: 'Necessary Memorization in Overparameterized Learning under Long-Tailed Mixture Models: Theory and Privacy Implications' provides formal theoretical support for Feldman's necessity claim in a specific mathematical setting, and introduces differential privacy leakage as a downstream consequence of memorization. [13]
- 2025-10-01: 'Memorizing Long-tail Data Can Help Generalization Through Composition' (arxiv 2510.16322) proposes that the mechanism by which tail memorization aids generalization is compositional rather than coverage-based, offering a mechanistic refinement of Feldman's thesis. [14][15]
- 2026-01-01: MIT names mechanistic interpretability as its 2026 Breakthrough of the Year, institutionally recognizing circuit-level understanding as a viable alternative program to statistical generalization theory for understanding deep learning. [25]
- 2026-04-01: 'Grokking as Dimensional Phase Transition in Neural Networks' (arxiv 2604.04655) introduces a new theoretical framework for delayed generalization, framing the memorization-to-generalization transition as a phase transition in the dimensionality of learned representations. [1]
- 2026-04-24: Jamie Simon and Daniel Kunin (UC Berkeley) appear on Imbue's podcast arguing that a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [34]
- 2026-04-25: LawrenceC publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [12]
- 2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [11][33]
- 2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization; the post is crossposted to LessWrong and drives renewed interest in the original paper. [31][32][28]
- 2026-04-28: 'There Will Be a Scientific Theory of Deep Learning' (arxiv 2604.21691) appears as a late-April 2026 preprint offering the first formal academic-paper-level optimistic response to the thread's founding pessimism, its title directly inverting LawrenceC's framing. [10]
- 2026-04-30: At least three distinct OpenReview submissions titled 'Is Memorization Actually Necessary for Generalization?' appear, representing independent formal challenges to Feldman's affirmative claim. [52][53][54][56][57]
Perspectives
LawrenceC (Alignment Forum / LessWrong) — confirmed as Lawrence Chan
Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling (μP), explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.
Evolution: Consistent across all three posts. Lawrence Chan's personal website confirms his academic identity and bridge role between ML theory and AI safety communities. The μP practitioner cluster now documents real-world adoption of the theory's key practical output — the tool LawrenceC named as its primary deliverable — confirming the framing without resolving whether this output is sufficient to constitute a 'theory.'
Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)
A scientific theory of deep learning is achievable; their 'learning mechanics' framework, grounded in average-case training dynamics and aggregate statistics, is the right approach. Publicly promoted via Imbue podcast (April 24, 2026), YouTube presentation, and ongoing Twitter activity.
Evolution: Consistent; no direct public response to LawrenceC's critique yet apparent.
'There Will Be a Scientific Theory of Deep Learning' authors (April 2026 preprint)
A scientific theory of deep learning is achievable — the title directly inverts the pessimistic framing of LawrenceC's series and the thread's organizing premise. The formal academic-paper format marks this as the first non-blog-post optimistic response in this cycle.
Evolution: New voice this cycle. No indexed commentary on the preprint yet apparent; its reception will determine whether it enters the discourse as a serious theoretical contribution or a thesis statement.
Independent Medium commentary ('second formation' framing)
Learning mechanics represents a genuine paradigm transition — a 'second formation' of deep learning theory analogous to statistical mechanics succeeding classical mechanics — not merely an incremental research program.
Evolution: Consistent from prior cycle. More optimistic than Simon and Kunin's own public claims and directly contradicts LawrenceC's skepticism about practical utility.
Montanari and Urbani (dynamical decoupling)
In the large two-layer network regime, generalization dynamics and overfitting dynamics are separable — a result that, if it extends to deeper networks, would provide an algorithm-dependent structural account of why networks generalize despite interpolating training data.
Evolution: Consistent from prior cycle. NeurIPS 2025 Oral status and cross-disciplinary reach (Simons Institute, Stanford Mathematics) confirmed in prior cycle; no new developments this cycle.
Vitaly Feldman (memorization is necessary)
Memorization of tail examples is causally necessary for learning from long-tailed distributions. This reframes Zhang et al.'s result: memorization is part of learning, not evidence that theory is broken.
Evolution: Previously under challenge from three OpenReview submissions and ICLR 2025's 'When Memorization Hurts Generalization.' This cycle adds formal theoretical support in a specific setting ('Necessary Memorization in Overparameterized Learning under Long-Tailed Mixture Models') and a compositional mechanism account ('Memorizing Long-tail Data Can Help Generalization Through Composition'). The formal corroboration partially rehabilitates Feldman's core claim while narrowing it to specific distributional assumptions. The debate remains unresolved.
ICLR 2025 'When Memorization Hurts Generalization' authors
Memorization is not merely unnecessary for generalization but can actively damage it — the strongest anti-Feldman position yet to appear in a peer-reviewed venue.
Evolution: Consistent from prior cycle. Additional URLs (ICLR proceedings, OpenReview, arxiv HTML) confirmed this cycle, marking wide indexing of the paper.
Negrea et al. (defense of uniform convergence via derandomization)
Uniform convergence arguments can be partially recovered through derandomization techniques applied to interpolating classifiers, suggesting the Nagarajan-Kolter impossibility is not a blanket closure of the uniform convergence program.
Evolution: New voice this cycle; ICML 2020 paper surfaced via reactive search but not previously tracked in this synthesis. Whether this result engages or sidesteps Yang et al.'s exact gap result is unclear from available metadata.
Yang et al. 2021 (exact gap, uniform convergence)
The failure of uniform convergence in random feature models is not an artifact of loose bounds but is provably exact — there is a measurable algebraic gap between any uniform convergence bound and the true generalization error, even in principle.
Evolution: Consistent from prior cycle. The newly surfaced Negrea et al. defense of uniform convergence via derandomization creates an unresolved question about whether the exact gap forecloses derandomized arguments.
Wu et al. ICML 2023 (implicit regularization of dynamical stability, SGD)
SGD's dynamical stability provides an implicit regularizer that suppresses memorization and promotes generalization — a mechanistic account connecting training-dynamics arguments to the diffusion models result and the dynamical decoupling program.
Evolution: New voice this cycle; ICML 2023 paper surfaced via reactive search. Bridges the algorithmic stability cluster and the implicit regularization strand, suggesting a unified dynamical account of memorization suppression.
NeurIPS 2025 diffusion model memorization researchers
Implicit dynamical regularization during training prevents memorization in diffusion models — a training-dynamics explanation for a phenomenon that would otherwise require architectural or data-geometric accounts.
Evolution: Consistent from prior cycle. Additional URLs (OpenReview direct link, LPENS institutional page) confirmed this cycle, marking institutional dissemination beyond the conference venue.
PAC-Bayes compression bounds researchers (NeurIPS 2022)
PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.
Evolution: Consistent; no direct engagement with Yang et al.'s exact gap result or Negrea et al.'s derandomization argument yet apparent.
Algorithmic stability / SGD stability researchers
Generalization in deep learning can be explained through the stability properties of SGD. This approach is inherently algorithm- and data-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias.
Evolution: Consistent; Wu et al. ICML 2023 now bridges this cluster to the implicit regularization and diffusion memorization strands, providing a shared mechanistic account.
Mechanistic interpretability community (MIT 2026 Breakthrough)
Understanding deep learning through circuit-level mechanistic analysis of what algorithms individual networks implement is a viable — and now institutionally recognized — alternative to statistical generalization theory.
Evolution: New voice this cycle, anchored by MIT's 2026 Breakthrough designation and an ACM survey. Does not directly engage the generalization debate but implicitly reframes it: the question shifts from 'why do networks generalize statistically?' to 'what do individual networks compute mechanistically?'
ICLR 2025 LLM memorization researchers
Language models' generalization capabilities can be traced back to specific pretraining data, implying a causal data-memorization link that extends Feldman's long-tail thesis to the LLM regime.
Evolution: Consistent from prior cycle. ICLR proceedings PDF confirmed this cycle.
Tensions
- Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others permanently outside its scope? The April 2026 preprint 'There Will Be a Scientific Theory of Deep Learning' argues the question is not closed, while LawrenceC's skepticism and the 'second formation' framing remain in direct tension. [12][34][35][36][37][38][10]
- Is memorization causally necessary, causally harmful, or merely correlated with generalization — and does the answer change across training epochs? Grokking research shows networks can memorize first and generalize later, adding a temporal dimension to a debate previously conducted as if memorization were a fixed property. Feldman's necessity claim is being formalized under long-tailed mixture models and a compositional mechanism account, while 'When Memorization Hurts Generalization' claims the opposite. These positions may not be contradictory if memorization's effects depend on training epoch and regime. [51][52][53][54][55][56][57][4][5][6][13][14][15][2][1][3]
- Yang et al. 2021 establish an exact algebraic gap between generalization error and uniform convergence in random feature models. Negrea et al.'s ICML 2020 'In Defense of Uniform Convergence' argues the framework can be salvaged through derandomization applied to interpolating classifiers. Do derandomized uniform convergence arguments fall within the class foreclosed by the exact gap result, or do they constitute a genuine escape hatch that keeps the classical program viable? [31][8][9][83][84][7]
- The NeurIPS 2025 dynamical decoupling result establishes that generalization and overfitting are separable phenomena in large two-layer networks. Does this separation persist in deeper networks and in transformer architectures, and if so, does it support or complicate the learning mechanics program's focus on aggregate training dynamics? [39][40][22][43][44][45][46][47][48][50]
- If implicit dynamical regularization explains why diffusion models don't memorize, and Wu et al.'s SGD dynamical stability result provides a parallel mechanism, is there a unified dynamical account of memorization suppression across architectures? Or does the memorization debate remain architecture-regime-specific, with separate mechanisms governing transformers, diffusion models, and two-layer networks? [65][66][23][24][20][81][78][85]
- PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while Yang et al.'s exact gap result says uniform convergence-style arguments are provably incapable of capturing the correct quantity. Are PAC-Bayes compression bounds a genuine escape hatch from the Nagarajan-Kolter impossibility, or do they fall within the class of arguments the exact gap result forecloses? [67][68][86][69][8][9]
- Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions. ICML 2025's 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits these conditions — do they hold broadly enough to constitute a useful theory, or do they require fine-tuned assumptions that fail in realistic multi-layer, non-linear settings? [87][88][89][90][85][91][92]
- Is mechanistic interpretability a practical alternative to statistical generalization theory for understanding deep learning, or does its circuit-level focus on what algorithms individual networks implement simply answer a different question than statistical learning theory was trying to answer? MIT's 2026 Breakthrough designation for mechanistic interpretability implicitly reframes the theory-failure narrative without directly engaging it. [25][26][27][29][12][34]
Sources
- [1] Grokking as Dimensional Phase Transition in Neural Networks — reactive:deep-learning-theory-limits
- [2] [PDF] Towards Understanding Grokking: An Effective Theory of ... — reactive:deep-learning-theory-limits
- [3] The complexity dynamics of grokking - ScienceDirect — reactive:deep-learning-theory-limits
- [4] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
- [5] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
- [6] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
- [7] [PDF] In Defense of Uniform Convergence: Generalization via ... — reactive:deep-learning-theory-limits
- [8] [PDF] Exact Gap between Generalization Error and Uniform Convergence ... — reactive:deep-learning-theory-limits
- [9] [2103.04554] Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models — reactive:deep-learning-theory-limits
- [10] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
- [11] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
- [12] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
- [13] (PDF) Necessary Memorization in Overparameterized Learning ... — reactive:deep-learning-theory-limits
- [14] Memorizing Long-tail Data Can Help Generalization Through ... — reactive:deep-learning-theory-limits
- [15] Memorizing Long-tail Data Can Help Generalization Through Composition — reactive:deep-learning-theory-limits
- [16] Maximal Update Parameterization (μP) — reactive:deep-learning-theory-limits
- [17] The Practitioner's Guide to the Maximal Update Parameterization | EleutherAI Blog — reactive:deep-learning-theory-limits
- [18] Maximal Update Parameterization (muP or μP) - Sam Stevens — reactive:deep-learning-theory-limits
- [19] GQA-$\mu$P: The Maximal Parameterization Update for Grouped ... — reactive:deep-learning-theory-limits
- [20] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
- [21] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [22] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
- [23] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
- [24] Why diffusion models in generative AI don’t memorize: The role of implicit dynamical regularization in training | LPENS — reactive:deep-learning-theory-limits
- [25] Mechanistic Interpretability Named MIT's 2026 Breakthrough for ... — reactive:deep-learning-theory-limits
- [26] Bridging the Black Box: A Survey on Mechanistic Interpretability in AI — reactive:deep-learning-theory-limits
- [27] Understanding Mechanistic Interpretability in AI Models - IntuitionLabs — reactive:deep-learning-theory-limits
- [28] The other paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
- [29] AI Safety, Alignment, and Interpretability in 2026 | Zylos Research — reactive:deep-learning-theory-limits
- [30] Lawrence Chan — reactive:deep-learning-theory-limits
- [31] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
- [32] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
- [33] The paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
- [34] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
- [35] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
- [36] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
- [37] Daniel Kunin — reactive:deep-learning-theory-limits
- [38] Learning Mechanics and the Second Formation of Deep Learning ... — reactive:deep-learning-theory-limits
- [39] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
- [40] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [41] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [42] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [43] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
- [44] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
- [45] NeurIPS Oral Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
- [46] A Dynamical Theory of Overfitting and Generalization in Large Two-Layer Networks — reactive:deep-learning-theory-limits
- [47] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
- [48] [PDF] Dynamical decoupling of generalization and overfitting in large two ... — reactive:deep-learning-theory-limits
- [49] Dynamical Decoupling of Generalization and Overfitting in Large... — reactive:deep-learning-theory-limits
- [50] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
- [51] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
- [52] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
- [53] Is Memorization Actually Necessary for Generalization - OpenReview — reactive:deep-learning-theory-limits
- [54] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
- [55] [PDF] WHEN MEMORIZATION HURTS GENERALIZATION — reactive:deep-learning-theory-limits
- [56] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
- [57] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
- [58] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
- [59] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
- [60] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
- [61] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
- [62] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
- [63] Does Learning Require Memorization? A Short Tale About A Long Tail — reactive:deep-learning-theory-limits
- [64] [Quick Review] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
- [65] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
- [66] NeurIPS Poster Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
- [67] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
- [68] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
- [69] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
- [70] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
- [71] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
- [72] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
- [73] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
- [74] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
- [75] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
- [76] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
- [77] Stability (learning theory) — reactive:deep-learning-theory-limits
- [78] ICLR Poster Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
- [79] Generalization v.s. Memorization: Tracing Language Models'... — reactive:deep-learning-theory-limits
- [80] [PDF] generalization v.s. memorization - arXiv — reactive:deep-learning-theory-limits
- [81] For Better or for Worse, Transformers Seek Patterns for Memorization | OpenReview — reactive:deep-learning-theory-limits
- [82] [PDF] generalization v.s. memorization: tracing language models ... — reactive:deep-learning-theory-limits
- [83] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [84] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [85] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
- [86] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
- [87] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
- [88] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
- [89] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
- [90] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
- [91] NeurIPS Benign Overfitting in Out-of-Distribution Generalization of Linear Models — reactive:deep-learning-theory-limits
- [92] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
- [93] Uniform convergence may be unable to explain generalization in ... — reactive:deep-learning-theory-limits
- [94] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
- [95] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
- [96] Understanding Deep Learning (Still) Requires Rethinking ... — reactive:deep-learning-theory-limits
- [97] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
- [98] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
- [99] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits