Deep Learning Theory Is Broken — And Maybe Unfixable

closed · v11 · 2026-05-03 · 361 items · history

What's new in v11

Three genuinely new developments distinguish this pass: 'Generalization at the Edge of Stability' (2604.19740) completes the EoS cluster's transition from NTK dynamics description to generalization outcomes; the 'Separation of timescales' result (Harvard Math, December 2025) provides the first formal mechanism candidate to unify the dynamical decoupling, grokking, and LLM memorization dynamics results; and Marc Mézard's NeurIPS 2025 Award makes NeurIPS 2025 the conference at which three distinct recognitions converged on the statistical-physics approach. Additional developments — ICML 2026 Mech Interp workshop entering reviewer recruitment, μP gaining industry practitioner documentation at Cerebras and Microsoft, and 'Reason to Rote' extending the memorization debate to reasoning — extend rather than reframe existing strands.

Narrative

Three significant developments distinguish this cycle. 'Generalization at the Edge of Stability' (arxiv 2604.19740)[1][2] — announced via YouTube talk[3] and X/Twitter[4] — directly examines whether generalization behavior changes in the EoS training regime, filling the gap left by the NTK EoS evolution paper[5][6][7] which described NTK dynamics in EoS without addressing generalization outcomes. The EoS cluster now spans four distinct conference contributions: NTK EoS evolution, a NeurIPS 2025 convergence-rates poster[8], an ICML 2025 sharpness-dynamics poster[9], and the new generalization paper. Marc Mézard — whose cavity and replica methods underpin the Montanari-Urbani dynamical decoupling program — received a NeurIPS 2025 Award[10], making NeurIPS 2025 the conference at which three distinct recognitions (career award, Oral, Best Paper) converged on the statistical-physics approach to neural network theory, a clustering without parallel in the classical generalization bounds program.

'Separation of timescales controls feature learning and overfitting in large neural networks,' covered by TechXplore[11] and delivered at the Harvard Mathematics department[12], argues that feature learning precedes overfitting in large overparameterized networks — providing the first formal mechanism candidate to unify the Montanari-Urbani decoupling result, grokking's sequential memorize-then-generalize structure, and Tirumala et al.'s LLM training dynamics finding that memorization peaks early while generalization continues[13]. The Harvard Math venue placement directly parallels the Stanford Mathematics[14] and Simons Institute[15] appearances of the Montanari-Urbani result. 'Reason to Rote: Rethinking Memorization in Reasoning' (EMNLP 2025)[16] extends the memorization debate to reasoning tasks, asking whether apparent reasoning capabilities rely on memorized patterns — a new front connecting the Feldman necessity debate to reasoning performance research.

The practice cluster has consolidated. μP has acquired industry-scale practitioner documentation at Cerebras[17] and Microsoft[18][19], addressing Lawrence Chan's critique that learning mechanics had produced little beyond a useful heuristic. The ICML 2026 Mechanistic Interpretability workshop entered active reviewer recruitment[20], confirming the transition from announced to operational. Andriy Burkov amplified the learning mechanics preprint to a large practitioner audience[21]. A Springer paper combining pruning with differential privacy for LMs[22] bridges the engineering memorization mitigation cluster to the formal memorization-privacy nexus of item 3189. The XAI transition cluster has grown with a Nature npj paper arguing XAI needs formalization[23] and a Springer paper advocating inherently interpretable models[24], though these remain institutionally disconnected from the mechanistic interpretability and statistical generalization theory programs.

Timeline

2016-01-01: Zhang et al. demonstrate neural networks can memorize random labels, invalidating data-independent generalization bounds. [26]
2019-01-01: Nagarajan and Kolter prove formally that uniform convergence is provably insufficient to explain gradient descent generalization in an overparameterized linear setting. [25][171][172][183][184]
2019-06-01: Feldman publishes 'Does Learning Require Memorization?' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions. [69][74][185][75][76][77][78][86][186]
2020-07-01: Negrea et al. 'In Defense of Uniform Convergence' at ICML 2020 argues uniform convergence can be partially recovered through derandomization applied to interpolating classifiers. [90][91][92]
2020-11-01: 'Disentangling Feature and Lazy Training in Deep Neural Networks' provides a formal framework for distinguishing NTK (lazy) regime from feature learning regime. [150]
2021-01-01: Yang et al. establish an exact algebraic gap between generalization error and the tightest possible uniform convergence bound in random feature models. [93][94]
2021-01-01: ACM Communications publishes the canonical journal version of Zhang et al.'s rethinking-generalization result. [187]
2022-01-01: NeurIPS 2022 paper claims PAC-Bayes compression bounds can be tight enough to explain generalization in neural networks. [111][112][113]
2022-01-01: NeurIPS 2022 'Towards Understanding Grokking: An Effective Theory of Representation Learning' frames grokking as a consequence of representation learning dynamics. [168]
2022-01-01: Tirumala et al. 'Memorization Without Overfitting' (NeurIPS 2022) finds memorization in LLMs peaks early then declines while generalization continues — temporal separation echoing grokking's sequential structure. [13]
2023-07-01: Wu et al. ICML 2023 show SGD's dynamical stability provides an implicit regularizer suppressing memorization, bridging algorithmic stability and implicit regularization. [95][96]
2023-11-01: 'Mind the Spikes: Benign Overfitting of Kernels and Neural Networks' (NeurIPS 2023) connects NTK spectrum spikes to benign overfitting conditions. [181]
2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks via mean-field view offers a structural lens on why overparameterization does not prevent generalization. [188][189]
2024-07-01: u-μP (arxiv 2407.17465) refines the μP framework with unit scaling, representing continued active development of learning mechanics' main practical artifact. [167]
2024-10-01: 'Mitigating Memorization in Language Models' proposes engineering techniques to reduce LLM memorization, treating it as a practical harm rather than a theoretically necessary property. [158][159]
2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks. [190]
2025-01-01: Urbani delivers a Stanford Mathematics department seminar on generalization in two-layer neural networks, confirming the Montanari-Urbani result circulates in pure-math venues. [14]
2025-01-01: ICLR 2025 'When Memorization Hurts Generalization' argues memorization can actively damage generalization — the strongest anti-Feldman position yet in a peer-reviewed venue. [67][87][88][89]
2025-01-01: NeurIPS 2025 Oral 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) separates generalization and overfitting dynamics formally; oral status confirmed. [50][51][52][53][54][55][56][57][58][59][60][61]
2025-01-01: NeurIPS 2025 Best Paper 'Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training' confirmed via official NeurIPS blog — most institutionally recognized result in the training-dynamics cluster. [98][99][100][101][102][103][104][105][106][107][108][109][110][97]
2025-01-01: ICML 2025 poster 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits conditions under which interpolating classifiers generalize. [173]
2025-01-01: ICLR 2025 poster 'Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data' examines how specific pretraining data drives capability versus memorization in LLMs. [143][144][145][147][142]
2025-01-01: 'Necessary Memorization in Overparameterized Learning under Long-Tailed Mixture Models' provides formal theoretical support for Feldman's necessity claim and introduces differential privacy leakage as a consequence. [63]
2025-01-01: NeurIPS 2025 poster 'Understanding the Evolution of the Neural Tangent Kernel' tracks NTK changes during training, engaging the lazy-training versus feature-learning distinction. [37]
2025-01-01: 'Reactivation: Empirical NTK Dynamics Under Task Shifts' (ETH Zurich) tracks NTK dynamics as tasks change, connecting NTK evolution to continual learning. [151][191]
2025-01-01: 'Generative Modeling of Weights: Generalization or Memorization?' opens a new subdomain asking whether generative models of weight distributions generalize to unseen architectures or merely memorize training-set configurations. [170][192][193][194][195][196][197]
2025-01-01: Workshop on Mechanistic Interpretability confirmed in NeurIPS 2025 virtual infrastructure, marking the field's presence at NeurIPS alongside the ICML 2026 workshop. [138]
2025-01-01: NeurIPS 2025 poster 'A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression' connects NTK alignment dynamics to phase transitions, bridging NTK evolution and grokking/phase-transition research. [148]
2025-01-01: 'Actionable Interpretability' workshop at ICML 2025 adds a second distinct interpretability-focused workshop track. [141]
2025-01-01: ICML 2025 poster 'Adaptive kernel predictors from feature-learning infinite limits of neural networks' suggests NTK infinite-width limits can accommodate feature learning, potentially bridging lazy-training and feature-learning regimes. [149]
2025-02-01: 'Pruning as a Defense: Reducing Memorization in Large Language Models' demonstrates model pruning as a practical defense against LLM memorization. [156][157]
2025-02-19: Urbani delivers invited talk at Simons Institute (Berkeley) on generalization and overfitting in two-layer networks. [15]
2025-07-01: 'Grokking Beyond the Euclidean Norm of Model Parameters' (ICML 2025) challenges L2-norm-based accounts of the grokking phase transition. [155]
2025-07-01: 'Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability' (arxiv 2507.12837) extends NTK dynamics analysis to the EoS regime; OpenReview submission confirmed. [5][6][7]
2025-07-01: ICML 2025 poster 'Understanding Sharpness Dynamics in NN Training with a Minimalist Example' examines how dataset difficulty, depth, and stochasticity shape sharpness dynamics. [9]
2025-10-01: 'Memorizing Long-tail Data Can Help Generalization Through Composition' proposes a compositional rather than coverage-based mechanism for how tail memorization aids generalization. [64][79][198]
2025-11-01: 'Reason to Rote: Rethinking Memorization in Reasoning' (EMNLP 2025) extends the memorization debate to reasoning tasks, asking whether apparent reasoning capabilities rely on memorized patterns rather than compositional generalization. [16]
2025-12-01: Marc Mézard receives a NeurIPS 2025 Award for foundational contributions to statistical mechanics of disordered systems — making NeurIPS 2025 the conference at which three distinct recognitions (career award, Oral, Best Paper) converged on the statistical-physics approach to neural network theory. [10]
2025-12-01: NeurIPS 2025 poster 'Convergence Rates for Gradient Descent on the Edge of Stability for Overparametrised Least Squares' provides theoretical convergence guarantees for GD in the EoS regime. [8]
2025-12-01: 'Separation of timescales controls feature learning and overfitting in large neural networks' — TechXplore coverage and Harvard Math department seminar — argues feature learning precedes overfitting in large overparameterized networks, the first formal mechanism candidate to unify dynamical decoupling, grokking, and LLM memorization dynamics. [11][12]
2026-01-01: MIT names mechanistic interpretability its 2026 Breakthrough of the Year, institutionally recognizing circuit-level understanding as a viable alternative to statistical generalization theory. [122][126][127][199]
2026-01-01: ICML 2026 mechanistic interpretability workshop announced by Neel Nanda with confirmed infrastructure across Buttondown, OpenReview, ICML Blog, and a GitHub 'Open problems' status report. [128][129][131][132][133][134][200][201][135][202][203][136][137]
2026-02-01: 'Beyond Explainable AI (XAI): An Overdue Paradigm Shift' (arxiv 2602.24176) argues the XAI paradigm has exhausted itself and a post-XAI framework is needed, introducing an applied-accountability pressure distinct from statistical theory and circuit-level interpretability. [164][161][162]
2026-03-01: MIT News covers research on improving AI models' ability to explain their predictions, extending science-communication investment in interpretability that the statistical generalization bounds program has not received. [130]
2026-04-01: 'Grokking as Dimensional Phase Transition in Neural Networks' (arxiv 2604.04655) frames the memorization-to-generalization transition as a phase transition in learned representation dimensionality. [169]
2026-04-01: 'Training Data Pruning Improves Memorization of Facts' (arxiv 2604.08519) shows selective data reduction can improve fact memorization, suggesting data redundancy suppresses rather than reinforces memorization. [80][182][204][205][206]
2026-04-01: 'Generalization at the Edge of Stability' (arxiv 2604.19740) directly examines generalization behavior in the EoS training regime — completing the EoS cluster's transition from dynamics description to generalization outcomes. [3][4][1][2]
2026-04-01: ICML 2026 Mechanistic Interpretability workshop enters active reviewer recruitment, confirming transition to operationally active status. [20]
2026-04-24: Jamie Simon and Daniel Kunin appear on Imbue's podcast arguing a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [33][38]
2026-04-25: LawrenceC (Lawrence Chan) publishes a critical review of the learning mechanics manifesto on the Alignment Forum, welcoming its ambition but doubting it will deliver a comprehensive theory. [27][32]
2026-04-25: Andriy Burkov (author of 'The Hundred-Page Machine Learning Book') amplifies the learning mechanics preprint on LinkedIn, extending its reach to a large practitioner community. [21]
2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory' explaining the historical significance of Zhang et al. 2016. [26][29]
2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory' on Nagarajan and Kolter 2019; crossposted to LessWrong. [25][28][30]
2026-04-28: 'There Will Be a Scientific Theory of Deep Learning' (arxiv 2604.21691) appears as the first formal academic-paper-level optimistic response, its title directly inverting LawrenceC's framing. ResearchGate indexing confirmed. [42][44][39]
2026-04-30: At least three distinct OpenReview submissions titled 'Is Memorization Actually Necessary for Generalization?' appear, representing independent formal challenges to Feldman's affirmative claim. [70][65][66][71][72]
2026-05-01: LessWrong 'Quick Paper Review' and Reddit r/MachineLearning thread engage the 'There Will Be a Scientific Theory' preprint in the communities where LawrenceC's pessimism originated. [40][41]

Perspectives

LawrenceC (Alignment Forum / LessWrong) — confirmed as Lawrence Chan

Classical deep learning theory was irreparably broken by Zhang et al. 2016 and Nagarajan-Kolter 2019. The learning mechanics replacement is a promising manifesto but has produced little practical fruit beyond μP and does not aim to explain specific learned algorithms.

Evolution: Consistent across all posts. The μP industry adoption at Cerebras and Microsoft[17][18][19] partially addresses his 'only μP' critique, but the question remains whether industry adoption of a heuristic validates the theory that motivated it.

[25][26][27][28][29][30][31][32]

Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)

A scientific theory of deep learning is achievable via average-case training dynamics and aggregate statistics. Promoted via Imbue podcast, YouTube, and Twitter. Now amplified to practitioners via Andriy Burkov's LinkedIn post.

Evolution: Consistent; no direct public response to LawrenceC's critique. Burkov amplification[21] extends audience to practitioners. μP practitioner guides[17][18] provide indirect external validation.

[33][34][35][36][37][38][21]

'There Will Be a Scientific Theory of Deep Learning' authors (April 2026 preprint)

A scientific theory of deep learning is achievable — the title directly inverts LawrenceC's pessimistic framing. Formally indexed via ResearchGate and actively engaged on LessWrong and Reddit.

Evolution: Consistent from prior cycle; ResearchGate[39] and community engagement[40][41] confirmed. No formal critical response has emerged.

[42][43][44][45][46][47][48][40][41][39]

Independent Medium commentary ('second formation' framing)

Learning mechanics represents a genuine paradigm transition — a 'second formation' analogous to statistical mechanics succeeding classical mechanics — not merely an incremental research program.

Evolution: Consistent from prior cycle. More optimistic than Simon and Kunin's own public claims.

[49]

Montanari and Urbani (dynamical decoupling)

In the large two-layer network regime, generalization and overfitting dynamics are formally separable — a result that, if it extends to deeper networks, provides an algorithm-dependent structural account of why networks generalize despite interpolating training data.

Evolution: Consistent. NeurIPS 2025 Oral confirmed. The separation-of-timescales result[11][12] provides an independent formal mechanism that converges on the same phenomenon from a different approach, strengthening the decoupling program.

[50][51][52][53][54][55][56][15][57][58][59][60][61][14]

Vitaly Feldman (memorization is necessary)

Memorization of tail examples is causally necessary for learning from long-tailed distributions — reframing Zhang et al.'s result as a feature, not a bug.

Evolution: Feldman's homepage is now actively surveyed[62]. Under pressure from multiple directions: formalization under long-tailed mixture models[63], compositional mechanism refinement[64], three OpenReview challenges[65][66], 'When Memorization Hurts' opposition[67], applied LLM mitigation work treating memorization as undesirable[68], and now 'Reason to Rote'[16] extending the debate to reasoning.

[69][70][65][66][67][71][72][73][74][75][76][77][78][63][64][79][80][81][82][83][84][85][86][68][62][16]

ICLR 2025 'When Memorization Hurts Generalization' authors

Memorization is not merely unnecessary for generalization but can actively damage it — the strongest anti-Feldman position in a peer-reviewed venue.

Evolution: Consistent from prior cycle.

[67][87][88][89]

Negrea et al. (defense of uniform convergence via derandomization)

Uniform convergence can be partially recovered through derandomization applied to interpolating classifiers, suggesting Nagarajan-Kolter is not a blanket closure.

Evolution: Consistent. Whether their argument falls within the class foreclosed by Yang et al.'s exact gap remains unresolved.

[90][91][92]

Yang et al. 2021 (exact gap, uniform convergence)

The failure of uniform convergence in random feature models is provably exact — there is an algebraic gap between any uniform convergence bound and the true generalization error, even in principle.

Evolution: Consistent from prior cycle.

[93][94]

Wu et al. ICML 2023 (implicit regularization of dynamical stability, SGD)

SGD's dynamical stability provides an implicit regularizer suppressing memorization — a mechanistic account connecting training-dynamics arguments to the diffusion models result and dynamical decoupling program.

Evolution: Consistent. Now sits as the middle tier of a three-tier training-dynamics cluster (ICML 2023 < NeurIPS 2025 Oral < NeurIPS 2025 Best Paper) that collectively holds more top-venue recognition than the generalization bounds program in 2023-2025.

[95][96]

NeurIPS 2025 diffusion model memorization researchers (Best Paper)

Implicit dynamical regularization during training prevents memorization in diffusion models — a training-dynamics explanation that extends across the architectural boundary from SGD stability to score-based generative models.

Evolution: Consistent. Confirmed as NeurIPS 2025 Best Paper via official blog[97]. The EoS generalization paper[1] raises whether dynamical regularization accounts hold in the high-sharpness EoS regime.

[98][99][100][101][102][103][104][105][106][107][108][109][110][97]

Marc Mézard / Statistical physics foundational researchers

The cavity method and replica framework developed for spin glasses provide the exact analytical tools needed to study generalization and overfitting in neural networks at finite and infinite width.

Evolution: New recognition this cycle: Mézard's NeurIPS 2025 Award[10] confirms the ML community's formal acknowledgment of this foundational lineage — the third distinct NeurIPS 2025 recognition for the statistical-physics approach alongside the Montanari-Urbani Oral and diffusion model Best Paper.

[10]

Separation-of-timescales researchers (Harvard Math / TechXplore, December 2025)

Feature learning and overfitting operate on different timescales in large overparameterized networks, with feature learning preceding overfitting — a formal mechanism for dynamical decoupling that may generalize across architectures.

Evolution: New voice this cycle. Provides the first mechanism candidate to unify the Montanari-Urbani decoupling result, grokking's sequential structure, and Tirumala et al.'s LLM memorization dynamics. Harvard Math venue placement parallels the Montanari-Urbani Simons Institute and Stanford Math appearances.

[11][12]

PAC-Bayes compression bounds researchers (NeurIPS 2022)

PAC-Bayes bounds can be tight enough to actually explain generalization — the vacuousness of prior bounds was an artifact of loose construction, not a fundamental limit.

Evolution: Consistent; no direct engagement with Yang et al.'s exact gap result yet apparent.

[111][112][113]

Algorithmic stability / SGD stability researchers

Generalization in deep learning can be explained through the stability properties of SGD — an algorithm- and data-dependent approach that directly addresses the Nagarajan-Kolter critique.

Evolution: Consistent; Wu et al. ICML 2023 bridges this cluster to the implicit regularization and diffusion memorization strands.

[114][115][116][117][118][119][120][121][95]

Mechanistic interpretability community (MIT 2026 Breakthrough, Neel Nanda, multi-conference)

Circuit-level mechanistic analysis of what algorithms individual networks implement is a viable and now institutionally recognized alternative to statistical generalization theory.

Evolution: Previously anchored by MIT Breakthrough designation and ICML 2026 workshop announcement. This cycle: ICML 2026 workshop moves into active reviewer recruitment[20], confirming operational transition. The field now has self-sustaining multi-conference infrastructure across at least three consecutive slots.

[122][123][124][125][126][127][128][129][130][131][132][133][134][135][136][137][138][139][140][141][20]

ICLR 2025 LLM memorization researchers

Language models' generalization capabilities can be traced back to specific pretraining data, implying a causal data-memorization link that extends Feldman's long-tail thesis to the LLM regime.

Evolution: Consistent. arxiv page indexed[142].

[143][144][145][146][147][142]

NTK evolution researchers

The NTK is not static during training; understanding its evolution connects the classical NTK linearization program to the feature-learning regime that learning mechanics treats as operative in practice.

Evolution: Previously multi-directional (EoS[5], phase transitions[148], feature-learning infinite limits[149]). This cycle: OpenReview submission confirmed for the EoS evolution paper[6][7]; 'Generalization at the Edge of Stability' (2604.19740)[1][2] extends the cluster from dynamics description to generalization outcomes, completing the program's turn toward the core generalization question.

[37][150][151][152][153][154][149][5][148][3][6][4][7][1][2][8][9]

ICML 2025 grokking norm-critique researchers

The memorization-to-generalization transition in grokking is not fully explained by Euclidean weight norm dynamics — existing norm-based accounts are incomplete.

Evolution: Consistent. The NTK alignment/phase-transition paper[148] now provides an adjacent bridge between NTK and phase-transition programs that the norm-critique paper challenged without offering.

[155]

Applied LLM memorization mitigation researchers

Memorization in deployed LLMs is an 'undesirable' engineering problem to suppress through pruning, training interventions, and data curation — implicitly treating it as harmful, opposing Feldman's necessity thesis at the applied level.

Evolution: Consistent. This cycle: a Springer paper combining pruning with differential privacy[22] creates a new bridge between this engineering cluster and the formal memorization-privacy nexus of item 3189. The 'Reason to Rote' paper[16] at EMNLP extends the cluster's framing questions to reasoning capabilities.

[156][13][157][158][159][68][160][22]

Beyond XAI paradigm-shift researchers (arxiv 2602.24176) and XAI formalization cluster

The XAI paradigm has exhausted itself and a post-XAI or formalized framework is needed — arguing from the applied accountability side that the current interpretability landscape requires structural renewal.

Evolution: Previously a single preprint. This cycle: ResearchGate confirmed[161][162]; a Nature npj AI paper independently argues XAI needs formalization[23]; a Springer paper advocates inherently interpretable models as a contrasting paradigm[24]; a practitioner Medium post surveys 2026 interpretability state[163]. The cluster is growing but remains institutionally disconnected from circuit-level mechanistic interpretability and statistical generalization theory.

[164][161][162][165][163][23][24][166]

'Reason to Rote' / Reasoning-memorization researchers (EMNLP 2025)

Apparent reasoning capabilities in language models may rely on memorized reasoning patterns rather than compositional generalization — extending the Feldman necessity debate into the reasoning performance domain.

Evolution: New voice this cycle. Creates a direct link between the formal memorization-necessity debate and the applied reasoning literature that prior work had not examined.

[16]

Tensions

Can learning mechanics constitute a comprehensive theory of deep learning, or is it structurally limited to explaining some aspects while leaving others outside its scope? The April 2026 preprint 'There Will Be a Scientific Theory of Deep Learning' argues the question is not closed. The NeurIPS 2025 Best Paper and the separation-of-timescales result provide indirect empirical warrant for the dynamics-first framing, while μP's industry adoption at Cerebras and Microsoft partially addresses the 'no practical output' critique — but whether industry adoption of a heuristic validates the underlying theory remains contested. [27][33][34][35][36][49][42][40][41][37][150][151][167][5][148][21][17][18][19]
Is memorization causally necessary, causally harmful, or merely correlated with generalization — and does the answer change across epochs, architectures, task settings, and domains? Feldman's necessity claim is being formalized under long-tailed mixture models with a compositional mechanism, while 'When Memorization Hurts' claims the opposite, applied LLM mitigation work frames memorization as 'undesirable,' and now 'Reason to Rote' asks whether reasoning itself is memorization. These positions may not be contradictory if memorization's effects depend on training epoch, regime, task structure, and capability type — but no synthesis has been attempted. [69][70][65][66][67][71][72][87][88][89][63][64][79][168][169][80][81][170][156][13][157][158][159][68][16]
Yang et al. 2021 establish an exact algebraic gap between generalization error and uniform convergence in random feature models. Negrea et al.'s ICML 2020 'In Defense of Uniform Convergence' argues the framework can be salvaged through derandomization. Do derandomized uniform convergence arguments fall within the class foreclosed by the exact gap result, or do they constitute a genuine escape hatch? [25][93][94][171][172][90][91][92]
The NeurIPS 2025 dynamical decoupling result establishes generalization and overfitting are separable in large two-layer networks. The separation-of-timescales result provides a formal mechanism — feature learning precedes overfitting on a faster timescale. Does this timescale mechanism persist in deeper networks and transformer architectures, and if so, does it constitute the unified dynamical account that the grokking, LLM memorization dynamics, and decoupling results have each described from different angles without synthesis? [50][51][53][56][15][57][58][59][60][14][13][11][12]
If implicit dynamical regularization explains why diffusion models don't memorize (NeurIPS 2025 Best Paper) and Wu et al.'s SGD dynamical stability provides a parallel mechanism, is there a unified dynamical account of memorization suppression across architectures? The 'Generalization at the Edge of Stability' paper[1][2] now raises whether the same dynamical regularization account holds in the EoS regime, where gradient descent operates near sharpness boundaries with large stable learning rates — a condition that may alter the dynamical regularization landscape. [98][99][100][101][106][107][95][146][143][173][151][5][3][1][2][8]
PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while Yang et al.'s exact gap result says uniform convergence-style arguments are provably incapable of capturing the correct quantity. Are PAC-Bayes compression bounds a genuine escape hatch from the Nagarajan-Kolter impossibility, or do they fall within the class the exact gap result forecloses? [111][112][174][113][93][94]
Benign overfitting results show interpolating classifiers can generalize under certain geometric conditions. ICML 2025's 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits these conditions — do they hold broadly enough to constitute a useful theory, or do they require fine-tuned assumptions that fail in realistic multi-layer, non-linear settings? [175][176][177][178][173][179][180][181]
Is mechanistic interpretability a practical alternative to statistical generalization theory for understanding deep learning, or does circuit-level analysis simply answer a different question? The field now has multi-conference workshop infrastructure and a named advocate spanning safety, alignment, and academic ML — but institutional success is not theoretical displacement. The 'Beyond XAI' paradigm-shift argument[164] and the XAI formalization cluster[23][24] introduce additional pressure from the applied accountability side, further fragmenting the interpretability landscape without bridging to statistical theory. [122][123][124][125][27][33][128][129][126][133][135][136][137][138][141][164][23][24][20]
The grokking sub-debate now has at least four competing accounts — norm-based, dimensional phase transition, effective theory of representation learning, and compositional mechanism — that have not been formally reconciled. The NTK alignment/phase-transition paper[148] provides an external bridge between NTK dynamics and phase transitions that the grokking literature has not yet incorporated. The separation-of-timescales result[11][12] offers a fifth candidate mechanism. Is grokking a single phenomenon admitting a unified theory, or a family of distinct delayed-generalization transitions sharing surface phenomenology? [155][169][168][64][79][148][11][12]
The NTK evolution program now spans edge of stability[5][6], phase transitions[148], adaptive kernel predictors from feature-learning limits[149], empirical task-shift dynamics[151], and — new this cycle — a direct generalization question in the EoS regime[1][2] alongside convergence-rates and sharpness-dynamics posters[8][9]. Does this cluster converge toward the feature-learning regime that learning mechanics treats as operative in practice, or does it remain a set of disconnected extensions of the lazy-training regime? [37][150][151][154][149][5][148][3][6][1][2][8][9]
Pruning reduces LLM memorization as an engineering defense[156], training data pruning counterintuitively improves fact memorization[80], and a Springer paper combining pruning with differential privacy[22] bridges the engineering mitigation cluster to the formal memorization-privacy nexus introduced in item 3189. Do these pruning-based results share a common theoretical account, or do they represent distinct mechanisms that happen to involve data or weight reduction? [156][157][80][182][22][63]

Status: active and growing

Sources

[1] Generalization at the Edge of Stability — reactive:deep-learning-theory-limits
[2] [2604.19740] Generalization at the Edge of Stability - arXiv — reactive:deep-learning-theory-limits
[3] Generalization at the Edge of Stability (Apr 2026) - YouTube — reactive:deep-learning-theory-limits
[4] Generalization at the Edge of Stability — reactive:deep-learning-theory-limits
[5] [2507.12837] Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability — reactive:deep-learning-theory-limits
[6] Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability | OpenReview — reactive:deep-learning-theory-limits
[7] [PDF] Understanding the Evolution of the Neural Tangent Kernel at ... — reactive:deep-learning-theory-limits
[8] NeurIPS Poster Convergence Rates for Gradient Descent on the Edge of Stability for Overparametrised Least Squares — reactive:deep-learning-theory-limits
[9] ICML Poster Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More — reactive:deep-learning-theory-limits
[10] Marc Mézard Receives NeurIPS 2025 Award | Bocconi University — reactive:deep-learning-theory-limits
[11] Overparameterized neural networks: Feature learning precedes ... — reactive:deep-learning-theory-limits
[12] Separation of timescales controls feature learning and overfitting in large neural networks - Harvard Math — reactive:deep-learning-theory-limits
[13] [PDF] Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models | Semantic Scholar — reactive:deep-learning-theory-limits
[14] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
[15] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
[16] [PDF] Reason to Rote: Rethinking Memorization in Reasoning — reactive:deep-learning-theory-limits
[17] The Practitioner’s Guide to the Maximal Update Parameterization - Cerebras — reactive:deep-learning-theory-limits
[18] microsoft/mup: maximal update parametrization (µP) - GitHub — reactive:deep-learning-theory-limits
[19] µTransfer: A technique for hyperparameter tuning of enormous ... — reactive:deep-learning-theory-limits
[20] Mech Interp Workshop Reviewer Expression of Interest (ICML 2026) — reactive:deep-learning-theory-limits
[21] Andriy Burkov's Post - LinkedIn — reactive:deep-learning-theory-limits
[22] Accelerating Language Models with Pruning and Differentially Private ... — reactive:deep-learning-theory-limits
[23] Explainable AI needs formalization | npj Artificial Intelligence — reactive:deep-learning-theory-limits
[24] Inherently Interpretable Machine Learning: A Contrasting Paradigm to Post-hoc Explainable AI | Business & Information Systems Engineering | Springer Nature Link — reactive:deep-learning-theory-limits
[25] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
[26] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
[27] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
[28] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
[29] The paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
[30] The other paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
[31] Lawrence Chan — reactive:deep-learning-theory-limits
[32] ‪Lawrence Chan‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
[33] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
[34] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
[35] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
[36] Daniel Kunin — reactive:deep-learning-theory-limits
[37] Understanding the Evolution of the Neural Tangent Kernel at the ... — reactive:deep-learning-theory-limits
[38] There Will Be a Scientific Theory of Deep Learning - Imbue — reactive:deep-learning-theory-limits
[39] (PDF) There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
[40] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — LessWrong — reactive:deep-learning-theory-limits
[41] There Will Be a Scientific Theory of Deep Learning [R] - Reddit — reactive:deep-learning-theory-limits
[42] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
[43] There Will Be a Scientific Theory of Deep Learning | Cool Papers — reactive:deep-learning-theory-limits
[44] [2604.21691] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
[45] There Will Be a Scientific Theory of Deep Learning | alphaXiv — reactive:deep-learning-theory-limits
[46] There Will Be a Scientific Theory of Deep Learning | Takara TLDR — reactive:deep-learning-theory-limits
[47] There Will Be a Scientific Theory of Deep Learning - YouTube — reactive:deep-learning-theory-limits
[48] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
[49] Learning Mechanics and the Second Formation of Deep Learning ... — reactive:deep-learning-theory-limits
[50] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[51] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[52] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[53] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
[54] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[55] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[56] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
[57] NeurIPS Oral Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[58] A Dynamical Theory of Overfitting and Generalization in Large Two-Layer Networks — reactive:deep-learning-theory-limits
[59] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
[60] [PDF] Dynamical decoupling of generalization and overfitting in large two ... — reactive:deep-learning-theory-limits
[61] Dynamical Decoupling of Generalization and Overfitting in Large... — reactive:deep-learning-theory-limits
[62] Vitaly Feldman's personal homepage — reactive:deep-learning-theory-limits
[63] (PDF) Necessary Memorization in Overparameterized Learning ... — reactive:deep-learning-theory-limits
[64] Memorizing Long-tail Data Can Help Generalization Through ... — reactive:deep-learning-theory-limits
[65] Is Memorization Actually Necessary for Generalization - OpenReview — reactive:deep-learning-theory-limits
[66] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
[67] [PDF] WHEN MEMORIZATION HURTS GENERALIZATION — reactive:deep-learning-theory-limits
[68] Undesirable Memorization in Large Language Models: A Survey — reactive:deep-learning-theory-limits
[69] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
[70] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
[71] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
[72] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
[73] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
[74] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
[75] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
[76] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
[77] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
[78] Does Learning Require Memorization? A Short Tale About A Long Tail — reactive:deep-learning-theory-limits
[79] Memorizing Long-tail Data Can Help Generalization Through Composition — reactive:deep-learning-theory-limits
[80] [PDF] Training Data Pruning Improves Memorization of Facts - arXiv — reactive:deep-learning-theory-limits
[81] What is the role of memorization in Continual Learning? | OpenReview — reactive:deep-learning-theory-limits
[82] ‪Vitaly Feldman‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
[83] What Neural Networks Memorize and Why (Vitaly Feldman) - NeurIPS — reactive:deep-learning-theory-limits
[84] Vitaly Feldman's personal homepage — reactive:deep-learning-theory-limits
[85] ‪Vitaly Feldman‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
[86] [PDF] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
[87] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
[88] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
[89] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
[90] [PDF] In Defense of Uniform Convergence: Generalization via ... — reactive:deep-learning-theory-limits
[91] [1912.04265] In Defense of Uniform Convergence - arXiv — reactive:deep-learning-theory-limits
[92] In Defense of Uniform Convergence: Generalization via Derandomization with an Application to Interpolating Predictors — reactive:deep-learning-theory-limits
[93] [PDF] Exact Gap between Generalization Error and Uniform Convergence ... — reactive:deep-learning-theory-limits
[94] [2103.04554] Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models — reactive:deep-learning-theory-limits
[95] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[96] [Quick Review] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[97] Announcing the NeurIPS 2025 Best Paper Awards – NeurIPS Blog — reactive:deep-learning-theory-limits
[98] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
[99] NeurIPS Poster Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
[100] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
[101] Why diffusion models in generative AI don’t memorize: The role of implicit dynamical regularization in training | LPENS — reactive:deep-learning-theory-limits
[102] The Role of Implicit Dynamical Regularization in Training - YouTube — reactive:deep-learning-theory-limits
[103] [NeurIPS 2025] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
[104] Why Diffusion Models Generalize Instead of Just Memorizing - Medium — reactive:deep-learning-theory-limits
[105] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
[106] [2505.17638] Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
[107] Best paper NeurIPS 2025: 𝗪𝗵𝘆 𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹𝘀 𝗗𝗼𝗻’𝘁 𝗠𝗲𝗺𝗼𝗿𝗶𝘇𝗲: 𝗧𝗵𝗲 𝗥𝗼𝗹𝗲 𝗼𝗳 𝗜𝗺𝗽𝗹𝗶𝗰𝗶𝘁 𝗗𝘆𝗻𝗮𝗺𝗶𝗰𝗮𝗹 𝗥𝗲𝗴𝘂𝗹𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗶𝗻… | Charles H. Martin, PhD — reactive:deep-learning-theory-limits
[108] (PDF) Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
[109] [PDF] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
[110] [PDF] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
[111] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
[112] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[113] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
[114] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[115] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
[116] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
[117] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
[118] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
[119] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
[120] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
[121] Stability (learning theory) — reactive:deep-learning-theory-limits
[122] Mechanistic Interpretability Named MIT's 2026 Breakthrough for ... — reactive:deep-learning-theory-limits
[123] Bridging the Black Box: A Survey on Mechanistic Interpretability in AI — reactive:deep-learning-theory-limits
[124] Understanding Mechanistic Interpretability in AI Models - IntuitionLabs — reactive:deep-learning-theory-limits
[125] AI Safety, Alignment, and Interpretability in 2026 | Zylos Research — reactive:deep-learning-theory-limits
[126] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits
[127] MIT Technology Review's Post - Mechanistic interpretability - LinkedIn — reactive:deep-learning-theory-limits
[128] Mechanistic Interpretability Workshop at ICML 2026 — reactive:deep-learning-theory-limits
[129] Open problems in mechanistic interpretability: 2026 status report - Gist — reactive:deep-learning-theory-limits
[130] Improving AI models' ability to explain their predictions | MIT News — reactive:deep-learning-theory-limits
[131] Mech Interp Workshop @ ICML 2026 - Buttondown — reactive:deep-learning-theory-limits
[132] Schedule — reactive:deep-learning-theory-limits
[133] Neel Nanda 's Post - LinkedIn — reactive:deep-learning-theory-limits
[134] ICML 2026 Workshop Mech Interp - OpenReview — reactive:deep-learning-theory-limits
[135] Announcing the ICML 2026 Workshops and Affinity Workshops – ICML Blog — reactive:deep-learning-theory-limits
[136] Call for Papers | Mechanistic Interpretability Workshop at ICML 2026 — reactive:deep-learning-theory-limits
[137] Announcing the ICML 2026 Mechanistic Interpretability Workshop ... — reactive:deep-learning-theory-limits
[138] Workshop on Mechanistic Interpretability - NeurIPS 2026 — reactive:deep-learning-theory-limits
[139] An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers — Neel Nanda — reactive:deep-learning-theory-limits
[140] Mechanistic Interpretability — Neel Nanda — reactive:deep-learning-theory-limits
[141] Actionable Interpretability — reactive:deep-learning-theory-limits
[142] [2407.14985] Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
[143] ICLR Poster Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
[144] Generalization v.s. Memorization: Tracing Language Models'... — reactive:deep-learning-theory-limits
[145] [PDF] generalization v.s. memorization - arXiv — reactive:deep-learning-theory-limits
[146] For Better or for Worse, Transformers Seek Patterns for Memorization | OpenReview — reactive:deep-learning-theory-limits
[147] [PDF] generalization v.s. memorization: tracing language models ... — reactive:deep-learning-theory-limits
[148] NeurIPS Poster A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression — reactive:deep-learning-theory-limits
[149] ICML Poster Adaptive kernel predictors from feature-learning infinite limits of neural networks — reactive:deep-learning-theory-limits
[150] Disentangling feature and lazy training in deep neural networks — reactive:deep-learning-theory-limits
[151] Reactivation: Empirical NTK Dynamics Under Task Shifts — reactive:deep-learning-theory-limits
[152] [PDF] Neural Tangent Kernel - Washington — reactive:deep-learning-theory-limits
[153] [1811.04918] Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers — reactive:deep-learning-theory-limits
[154] Internal dynamics of neural networks through the NTK lens — reactive:deep-learning-theory-limits
[155] Grokking Beyond the Euclidean Norm of Model Parameters — reactive:deep-learning-theory-limits
[156] Pruning as a Defense: Reducing Memorization in Large Language Models — reactive:deep-learning-theory-limits
[157] [2502.15796] Pruning as a Defense: Reducing Memorization in Large Language Models — reactive:deep-learning-theory-limits
[158] Mitigating Memorization in Language Models — reactive:deep-learning-theory-limits
[159] [PDF] MITIGATING MEMORIZATION IN LANGUAGE MODELS — reactive:deep-learning-theory-limits
[160] Generalization potential of large language models | Neural Computing and Applications | Springer Nature Link — reactive:deep-learning-theory-limits
[161] (PDF) Beyond Explainable AI (XAI): An Overdue Paradigm Shift and ... — reactive:deep-learning-theory-limits
[162] [2602.24176] Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions — reactive:deep-learning-theory-limits
[163] Inside the AI Black Box, for Real This Time — The 2026 State of AI ... — reactive:deep-learning-theory-limits
[164] Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions — reactive:deep-learning-theory-limits
[165] A comprehensive survey on explainable artificial intelligence and ... — reactive:deep-learning-theory-limits
[166] From Explainability to Control: The 2026 Executive View of AI Interpretability and Explainability — reactive:deep-learning-theory-limits
[167] u-$μ$P: The Unit-Scaled Maximal Update Parametrization - arXiv — reactive:deep-learning-theory-limits
[168] [PDF] Towards Understanding Grokking: An Effective Theory of ... — reactive:deep-learning-theory-limits
[169] Grokking as Dimensional Phase Transition in Neural Networks — reactive:deep-learning-theory-limits
[170] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
[171] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[172] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[173] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
[174] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
[175] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
[176] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
[177] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
[178] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
[179] NeurIPS Benign Overfitting in Out-of-Distribution Generalization of Linear Models — reactive:deep-learning-theory-limits
[180] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
[181] Mind the spikes: Benign overfitting of kernels and neural ... — reactive:deep-learning-theory-limits
[182] Prune Training Data to Maximize LLM Factual Memorization | Changecast — reactive:deep-learning-theory-limits
[183] Uniform convergence may be unable to explain generalization in ... — reactive:deep-learning-theory-limits
[184] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
[185] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
[186] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
[187] Understanding Deep Learning (Still) Requires Rethinking ... — reactive:deep-learning-theory-limits
[188] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[189] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
[190] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits
[191] Reactivation: Empirical NTK Dynamics Under Task Shifts — reactive:deep-learning-theory-limits
[192] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
[193] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
[194] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
[195] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
[196] [2506.07998] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
[197] Generative Modeling of Neural Network Weights - Emergent Mind — reactive:deep-learning-theory-limits
[198] [PDF] Memorizing Long-tail Data Can Help Generalization Through ... - arXiv — reactive:deep-learning-theory-limits
[199] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits
[200] Workshop on Mechanistic Interpretability - ICML 2026 — reactive:deep-learning-theory-limits
[201] ICML 2026 Schedule — reactive:deep-learning-theory-limits
[202] 2026 Conference — reactive:deep-learning-theory-limits
[203] ICML 2026 Workshops — reactive:deep-learning-theory-limits
[204] Training Data Pruning Improves Memorization of Facts — reactive:deep-learning-theory-limits
[205] [PDF] TRAINING DATA PRUNING IMPROVES MEMORIZATION OF FACTS — reactive:deep-learning-theory-limits
[206] [2604.08519] Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts — reactive:deep-learning-theory-limits