The Information Machine

Deep Learning Theory Is Broken — And Maybe Unfixable

closed · v11 · 2026-05-03 · 361 items · history

Narrative

Three significant developments distinguish this cycle. 'Generalization at the Edge of Stability' (arxiv 2604.19740)[1][2] — announced via YouTube talk[3] and X/Twitter[4] — directly examines whether generalization behavior changes in the EoS training regime, filling the gap left by the NTK EoS evolution paper[5][6][7] which described NTK dynamics in EoS without addressing generalization outcomes. The EoS cluster now spans four distinct conference contributions: NTK EoS evolution, a NeurIPS 2025 convergence-rates poster[8], an ICML 2025 sharpness-dynamics poster[9], and the new generalization paper. Marc Mézard — whose cavity and replica methods underpin the Montanari-Urbani dynamical decoupling program — received a NeurIPS 2025 Award[10], making NeurIPS 2025 the conference at which three distinct recognitions (career award, Oral, Best Paper) converged on the statistical-physics approach to neural network theory, a clustering without parallel in the classical generalization bounds program.

'Separation of timescales controls feature learning and overfitting in large neural networks,' covered by TechXplore[11] and delivered at the Harvard Mathematics department[12], argues that feature learning precedes overfitting in large overparameterized networks — providing the first formal mechanism candidate to unify the Montanari-Urbani decoupling result, grokking's sequential memorize-then-generalize structure, and Tirumala et al.'s LLM training dynamics finding that memorization peaks early while generalization continues[13]. The Harvard Math venue placement directly parallels the Stanford Mathematics[14] and Simons Institute[15] appearances of the Montanari-Urbani result. 'Reason to Rote: Rethinking Memorization in Reasoning' (EMNLP 2025)[16] extends the memorization debate to reasoning tasks, asking whether apparent reasoning capabilities rely on memorized patterns — a new front connecting the Feldman necessity debate to reasoning performance research.

The practice cluster has consolidated. μP has acquired industry-scale practitioner documentation at Cerebras[17] and Microsoft[18][19], addressing Lawrence Chan's critique that learning mechanics had produced little beyond a useful heuristic. The ICML 2026 Mechanistic Interpretability workshop entered active reviewer recruitment[20], confirming the transition from announced to operational. Andriy Burkov amplified the learning mechanics preprint to a large practitioner audience[21]. A Springer paper combining pruning with differential privacy for LMs[22] bridges the engineering memorization mitigation cluster to the formal memorization-privacy nexus of item 3189. The XAI transition cluster has grown with a Nature npj paper arguing XAI needs formalization[23] and a Springer paper advocating inherently interpretable models[24], though these remain institutionally disconnected from the mechanistic interpretability and statistical generalization theory programs.

Timeline

  • 2016-01-01: Zhang et al. demonstrate neural networks can memorize random labels, invalidating data-independent generalization bounds. [26]
  • 2019-01-01: Nagarajan and Kolter prove formally that uniform convergence is provably insufficient to explain gradient descent generalization in an overparameterized linear setting. [25][171][172][183][184]
  • 2019-06-01: Feldman publishes 'Does Learning Require Memorization?' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions. [69][74][185][75][76][77][78][86][186]
  • 2020-07-01: Negrea et al. 'In Defense of Uniform Convergence' at ICML 2020 argues uniform convergence can be partially recovered through derandomization applied to interpolating classifiers. [90][91][92]
  • 2020-11-01: 'Disentangling Feature and Lazy Training in Deep Neural Networks' provides a formal framework for distinguishing NTK (lazy) regime from feature learning regime. [150]
  • 2021-01-01: Yang et al. establish an exact algebraic gap between generalization error and the tightest possible uniform convergence bound in random feature models. [93][94]
  • 2021-01-01: ACM Communications publishes the canonical journal version of Zhang et al.'s rethinking-generalization result. [187]
  • 2022-01-01: NeurIPS 2022 paper claims PAC-Bayes compression bounds can be tight enough to explain generalization in neural networks. [111][112][113]
  • 2022-01-01: NeurIPS 2022 'Towards Understanding Grokking: An Effective Theory of Representation Learning' frames grokking as a consequence of representation learning dynamics. [168]
  • 2022-01-01: Tirumala et al. 'Memorization Without Overfitting' (NeurIPS 2022) finds memorization in LLMs peaks early then declines while generalization continues — temporal separation echoing grokking's sequential structure. [13]
  • 2023-07-01: Wu et al. ICML 2023 show SGD's dynamical stability provides an implicit regularizer suppressing memorization, bridging algorithmic stability and implicit regularization. [95][96]
  • 2023-11-01: 'Mind the Spikes: Benign Overfitting of Kernels and Neural Networks' (NeurIPS 2023) connects NTK spectrum spikes to benign overfitting conditions. [181]
  • 2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks via mean-field view offers a structural lens on why overparameterization does not prevent generalization. [188][189]
  • 2024-07-01: u-μP (arxiv 2407.17465) refines the μP framework with unit scaling, representing continued active development of learning mechanics' main practical artifact. [167]
  • 2024-10-01: 'Mitigating Memorization in Language Models' proposes engineering techniques to reduce LLM memorization, treating it as a practical harm rather than a theoretically necessary property. [158][159]
  • 2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks. [190]
  • 2025-01-01: Urbani delivers a Stanford Mathematics department seminar on generalization in two-layer neural networks, confirming the Montanari-Urbani result circulates in pure-math venues. [14]
  • 2025-01-01: ICLR 2025 'When Memorization Hurts Generalization' argues memorization can actively damage generalization — the strongest anti-Feldman position yet in a peer-reviewed venue. [67][87][88][89]
  • 2025-01-01: NeurIPS 2025 Oral 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) separates generalization and overfitting dynamics formally; oral status confirmed. [50][51][52][53][54][55][56][57][58][59][60][61]
  • 2025-01-01: NeurIPS 2025 Best Paper 'Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training' confirmed via official NeurIPS blog — most institutionally recognized result in the training-dynamics cluster. [98][99][100][101][102][103][104][105][106][107][108][109][110][97]
  • 2025-01-01: ICML 2025 poster 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits conditions under which interpolating classifiers generalize. [173]
  • 2025-01-01: ICLR 2025 poster 'Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data' examines how specific pretraining data drives capability versus memorization in LLMs. [143][144][145][147][142]
  • 2025-01-01: 'Necessary Memorization in Overparameterized Learning under Long-Tailed Mixture Models' provides formal theoretical support for Feldman's necessity claim and introduces differential privacy leakage as a consequence. [63]
  • 2025-01-01: NeurIPS 2025 poster 'Understanding the Evolution of the Neural Tangent Kernel' tracks NTK changes during training, engaging the lazy-training versus feature-learning distinction. [37]
  • 2025-01-01: 'Reactivation: Empirical NTK Dynamics Under Task Shifts' (ETH Zurich) tracks NTK dynamics as tasks change, connecting NTK evolution to continual learning. [151][191]
  • 2025-01-01: 'Generative Modeling of Weights: Generalization or Memorization?' opens a new subdomain asking whether generative models of weight distributions generalize to unseen architectures or merely memorize training-set configurations. [170][192][193][194][195][196][197]
  • 2025-01-01: Workshop on Mechanistic Interpretability confirmed in NeurIPS 2025 virtual infrastructure, marking the field's presence at NeurIPS alongside the ICML 2026 workshop. [138]
  • 2025-01-01: NeurIPS 2025 poster 'A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression' connects NTK alignment dynamics to phase transitions, bridging NTK evolution and grokking/phase-transition research. [148]
  • 2025-01-01: 'Actionable Interpretability' workshop at ICML 2025 adds a second distinct interpretability-focused workshop track. [141]
  • 2025-01-01: ICML 2025 poster 'Adaptive kernel predictors from feature-learning infinite limits of neural networks' suggests NTK infinite-width limits can accommodate feature learning, potentially bridging lazy-training and feature-learning regimes. [149]
  • 2025-02-01: 'Pruning as a Defense: Reducing Memorization in Large Language Models' demonstrates model pruning as a practical defense against LLM memorization. [156][157]
  • 2025-02-19: Urbani delivers invited talk at Simons Institute (Berkeley) on generalization and overfitting in two-layer networks. [15]
  • 2025-07-01: 'Grokking Beyond the Euclidean Norm of Model Parameters' (ICML 2025) challenges L2-norm-based accounts of the grokking phase transition. [155]
  • 2025-07-01: 'Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability' (arxiv 2507.12837) extends NTK dynamics analysis to the EoS regime; OpenReview submission confirmed. [5][6][7]
  • 2025-07-01: ICML 2025 poster 'Understanding Sharpness Dynamics in NN Training with a Minimalist Example' examines how dataset difficulty, depth, and stochasticity shape sharpness dynamics. [9]
  • 2025-10-01: 'Memorizing Long-tail Data Can Help Generalization Through Composition' proposes a compositional rather than coverage-based mechanism for how tail memorization aids generalization. [64][79][198]
  • 2025-11-01: 'Reason to Rote: Rethinking Memorization in Reasoning' (EMNLP 2025) extends the memorization debate to reasoning tasks, asking whether apparent reasoning capabilities rely on memorized patterns rather than compositional generalization. [16]
  • 2025-12-01: Marc Mézard receives a NeurIPS 2025 Award for foundational contributions to statistical mechanics of disordered systems — making NeurIPS 2025 the conference at which three distinct recognitions (career award, Oral, Best Paper) converged on the statistical-physics approach to neural network theory. [10]
  • 2025-12-01: NeurIPS 2025 poster 'Convergence Rates for Gradient Descent on the Edge of Stability for Overparametrised Least Squares' provides theoretical convergence guarantees for GD in the EoS regime. [8]
  • 2025-12-01: 'Separation of timescales controls feature learning and overfitting in large neural networks' — TechXplore coverage and Harvard Math department seminar — argues feature learning precedes overfitting in large overparameterized networks, the first formal mechanism candidate to unify dynamical decoupling, grokking, and LLM memorization dynamics. [11][12]
  • 2026-01-01: MIT names mechanistic interpretability its 2026 Breakthrough of the Year, institutionally recognizing circuit-level understanding as a viable alternative to statistical generalization theory. [122][126][127][199]
  • 2026-01-01: ICML 2026 mechanistic interpretability workshop announced by Neel Nanda with confirmed infrastructure across Buttondown, OpenReview, ICML Blog, and a GitHub 'Open problems' status report. [128][129][131][132][133][134][200][201][135][202][203][136][137]
  • 2026-02-01: 'Beyond Explainable AI (XAI): An Overdue Paradigm Shift' (arxiv 2602.24176) argues the XAI paradigm has exhausted itself and a post-XAI framework is needed, introducing an applied-accountability pressure distinct from statistical theory and circuit-level interpretability. [164][161][162]
  • 2026-03-01: MIT News covers research on improving AI models' ability to explain their predictions, extending science-communication investment in interpretability that the statistical generalization bounds program has not received. [130]
  • 2026-04-01: 'Grokking as Dimensional Phase Transition in Neural Networks' (arxiv 2604.04655) frames the memorization-to-generalization transition as a phase transition in learned representation dimensionality. [169]
  • 2026-04-01: 'Training Data Pruning Improves Memorization of Facts' (arxiv 2604.08519) shows selective data reduction can improve fact memorization, suggesting data redundancy suppresses rather than reinforces memorization. [80][182][204][205][206]
  • 2026-04-01: 'Generalization at the Edge of Stability' (arxiv 2604.19740) directly examines generalization behavior in the EoS training regime — completing the EoS cluster's transition from dynamics description to generalization outcomes. [3][4][1][2]
  • 2026-04-01: ICML 2026 Mechanistic Interpretability workshop enters active reviewer recruitment, confirming transition to operationally active status. [20]
  • 2026-04-24: Jamie Simon and Daniel Kunin appear on Imbue's podcast arguing a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [33][38]
  • 2026-04-25: LawrenceC (Lawrence Chan) publishes a critical review of the learning mechanics manifesto on the Alignment Forum, welcoming its ambition but doubting it will deliver a comprehensive theory. [27][32]
  • 2026-04-25: Andriy Burkov (author of 'The Hundred-Page Machine Learning Book') amplifies the learning mechanics preprint on LinkedIn, extending its reach to a large practitioner community. [21]
  • 2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory' explaining the historical significance of Zhang et al. 2016. [26][29]
  • 2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory' on Nagarajan and Kolter 2019; crossposted to LessWrong. [25][28][30]
  • 2026-04-28: 'There Will Be a Scientific Theory of Deep Learning' (arxiv 2604.21691) appears as the first formal academic-paper-level optimistic response, its title directly inverting LawrenceC's framing. ResearchGate indexing confirmed. [42][44][39]
  • 2026-04-30: At least three distinct OpenReview submissions titled 'Is Memorization Actually Necessary for Generalization?' appear, representing independent formal challenges to Feldman's affirmative claim. [70][65][66][71][72]
  • 2026-05-01: LessWrong 'Quick Paper Review' and Reddit r/MachineLearning thread engage the 'There Will Be a Scientific Theory' preprint in the communities where LawrenceC's pessimism originated. [40][41]

Perspectives

LawrenceC (Alignment Forum / LessWrong) — confirmed as Lawrence Chan

Classical deep learning theory was irreparably broken by Zhang et al. 2016 and Nagarajan-Kolter 2019. The learning mechanics replacement is a promising manifesto but has produced little practical fruit beyond μP and does not aim to explain specific learned algorithms.

Evolution: Consistent across all posts. The μP industry adoption at Cerebras and Microsoft[17][18][19] partially addresses his 'only μP' critique, but the question remains whether industry adoption of a heuristic validates the theory that motivated it.

Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)

A scientific theory of deep learning is achievable via average-case training dynamics and aggregate statistics. Promoted via Imbue podcast, YouTube, and Twitter. Now amplified to practitioners via Andriy Burkov's LinkedIn post.

Evolution: Consistent; no direct public response to LawrenceC's critique. Burkov amplification[21] extends audience to practitioners. μP practitioner guides[17][18] provide indirect external validation.

'There Will Be a Scientific Theory of Deep Learning' authors (April 2026 preprint)

A scientific theory of deep learning is achievable — the title directly inverts LawrenceC's pessimistic framing. Formally indexed via ResearchGate and actively engaged on LessWrong and Reddit.

Evolution: Consistent from prior cycle; ResearchGate[39] and community engagement[40][41] confirmed. No formal critical response has emerged.

Independent Medium commentary ('second formation' framing)

Learning mechanics represents a genuine paradigm transition — a 'second formation' analogous to statistical mechanics succeeding classical mechanics — not merely an incremental research program.

Evolution: Consistent from prior cycle. More optimistic than Simon and Kunin's own public claims.

Montanari and Urbani (dynamical decoupling)

In the large two-layer network regime, generalization and overfitting dynamics are formally separable — a result that, if it extends to deeper networks, provides an algorithm-dependent structural account of why networks generalize despite interpolating training data.

Evolution: Consistent. NeurIPS 2025 Oral confirmed. The separation-of-timescales result[11][12] provides an independent formal mechanism that converges on the same phenomenon from a different approach, strengthening the decoupling program.

Vitaly Feldman (memorization is necessary)

Memorization of tail examples is causally necessary for learning from long-tailed distributions — reframing Zhang et al.'s result as a feature, not a bug.

Evolution: Feldman's homepage is now actively surveyed[62]. Under pressure from multiple directions: formalization under long-tailed mixture models[63], compositional mechanism refinement[64], three OpenReview challenges[65][66], 'When Memorization Hurts' opposition[67], applied LLM mitigation work treating memorization as undesirable[68], and now 'Reason to Rote'[16] extending the debate to reasoning.

ICLR 2025 'When Memorization Hurts Generalization' authors

Memorization is not merely unnecessary for generalization but can actively damage it — the strongest anti-Feldman position in a peer-reviewed venue.

Evolution: Consistent from prior cycle.

Negrea et al. (defense of uniform convergence via derandomization)

Uniform convergence can be partially recovered through derandomization applied to interpolating classifiers, suggesting Nagarajan-Kolter is not a blanket closure.

Evolution: Consistent. Whether their argument falls within the class foreclosed by Yang et al.'s exact gap remains unresolved.

Yang et al. 2021 (exact gap, uniform convergence)

The failure of uniform convergence in random feature models is provably exact — there is an algebraic gap between any uniform convergence bound and the true generalization error, even in principle.

Evolution: Consistent from prior cycle.

Wu et al. ICML 2023 (implicit regularization of dynamical stability, SGD)

SGD's dynamical stability provides an implicit regularizer suppressing memorization — a mechanistic account connecting training-dynamics arguments to the diffusion models result and dynamical decoupling program.

Evolution: Consistent. Now sits as the middle tier of a three-tier training-dynamics cluster (ICML 2023 < NeurIPS 2025 Oral < NeurIPS 2025 Best Paper) that collectively holds more top-venue recognition than the generalization bounds program in 2023-2025.

NeurIPS 2025 diffusion model memorization researchers (Best Paper)

Implicit dynamical regularization during training prevents memorization in diffusion models — a training-dynamics explanation that extends across the architectural boundary from SGD stability to score-based generative models.

Evolution: Consistent. Confirmed as NeurIPS 2025 Best Paper via official blog[97]. The EoS generalization paper[1] raises whether dynamical regularization accounts hold in the high-sharpness EoS regime.

Marc Mézard / Statistical physics foundational researchers

The cavity method and replica framework developed for spin glasses provide the exact analytical tools needed to study generalization and overfitting in neural networks at finite and infinite width.

Evolution: New recognition this cycle: Mézard's NeurIPS 2025 Award[10] confirms the ML community's formal acknowledgment of this foundational lineage — the third distinct NeurIPS 2025 recognition for the statistical-physics approach alongside the Montanari-Urbani Oral and diffusion model Best Paper.

Separation-of-timescales researchers (Harvard Math / TechXplore, December 2025)

Feature learning and overfitting operate on different timescales in large overparameterized networks, with feature learning preceding overfitting — a formal mechanism for dynamical decoupling that may generalize across architectures.

Evolution: New voice this cycle. Provides the first mechanism candidate to unify the Montanari-Urbani decoupling result, grokking's sequential structure, and Tirumala et al.'s LLM memorization dynamics. Harvard Math venue placement parallels the Montanari-Urbani Simons Institute and Stanford Math appearances.

PAC-Bayes compression bounds researchers (NeurIPS 2022)

PAC-Bayes bounds can be tight enough to actually explain generalization — the vacuousness of prior bounds was an artifact of loose construction, not a fundamental limit.

Evolution: Consistent; no direct engagement with Yang et al.'s exact gap result yet apparent.

Algorithmic stability / SGD stability researchers

Generalization in deep learning can be explained through the stability properties of SGD — an algorithm- and data-dependent approach that directly addresses the Nagarajan-Kolter critique.

Evolution: Consistent; Wu et al. ICML 2023 bridges this cluster to the implicit regularization and diffusion memorization strands.

Mechanistic interpretability community (MIT 2026 Breakthrough, Neel Nanda, multi-conference)

Circuit-level mechanistic analysis of what algorithms individual networks implement is a viable and now institutionally recognized alternative to statistical generalization theory.

Evolution: Previously anchored by MIT Breakthrough designation and ICML 2026 workshop announcement. This cycle: ICML 2026 workshop moves into active reviewer recruitment[20], confirming operational transition. The field now has self-sustaining multi-conference infrastructure across at least three consecutive slots.

ICLR 2025 LLM memorization researchers

Language models' generalization capabilities can be traced back to specific pretraining data, implying a causal data-memorization link that extends Feldman's long-tail thesis to the LLM regime.

Evolution: Consistent. arxiv page indexed[142].

NTK evolution researchers

The NTK is not static during training; understanding its evolution connects the classical NTK linearization program to the feature-learning regime that learning mechanics treats as operative in practice.

Evolution: Previously multi-directional (EoS[5], phase transitions[148], feature-learning infinite limits[149]). This cycle: OpenReview submission confirmed for the EoS evolution paper[6][7]; 'Generalization at the Edge of Stability' (2604.19740)[1][2] extends the cluster from dynamics description to generalization outcomes, completing the program's turn toward the core generalization question.

ICML 2025 grokking norm-critique researchers

The memorization-to-generalization transition in grokking is not fully explained by Euclidean weight norm dynamics — existing norm-based accounts are incomplete.

Evolution: Consistent. The NTK alignment/phase-transition paper[148] now provides an adjacent bridge between NTK and phase-transition programs that the norm-critique paper challenged without offering.

Applied LLM memorization mitigation researchers

Memorization in deployed LLMs is an 'undesirable' engineering problem to suppress through pruning, training interventions, and data curation — implicitly treating it as harmful, opposing Feldman's necessity thesis at the applied level.

Evolution: Consistent. This cycle: a Springer paper combining pruning with differential privacy[22] creates a new bridge between this engineering cluster and the formal memorization-privacy nexus of item 3189. The 'Reason to Rote' paper[16] at EMNLP extends the cluster's framing questions to reasoning capabilities.

Beyond XAI paradigm-shift researchers (arxiv 2602.24176) and XAI formalization cluster

The XAI paradigm has exhausted itself and a post-XAI or formalized framework is needed — arguing from the applied accountability side that the current interpretability landscape requires structural renewal.

Evolution: Previously a single preprint. This cycle: ResearchGate confirmed[161][162]; a Nature npj AI paper independently argues XAI needs formalization[23]; a Springer paper advocates inherently interpretable models as a contrasting paradigm[24]; a practitioner Medium post surveys 2026 interpretability state[163]. The cluster is growing but remains institutionally disconnected from circuit-level mechanistic interpretability and statistical generalization theory.

'Reason to Rote' / Reasoning-memorization researchers (EMNLP 2025)

Apparent reasoning capabilities in language models may rely on memorized reasoning patterns rather than compositional generalization — extending the Feldman necessity debate into the reasoning performance domain.

Evolution: New voice this cycle. Creates a direct link between the formal memorization-necessity debate and the applied reasoning literature that prior work had not examined.

Tensions

  • Can learning mechanics constitute a comprehensive theory of deep learning, or is it structurally limited to explaining some aspects while leaving others outside its scope? The April 2026 preprint 'There Will Be a Scientific Theory of Deep Learning' argues the question is not closed. The NeurIPS 2025 Best Paper and the separation-of-timescales result provide indirect empirical warrant for the dynamics-first framing, while μP's industry adoption at Cerebras and Microsoft partially addresses the 'no practical output' critique — but whether industry adoption of a heuristic validates the underlying theory remains contested. [27][33][34][35][36][49][42][40][41][37][150][151][167][5][148][21][17][18][19]
  • Is memorization causally necessary, causally harmful, or merely correlated with generalization — and does the answer change across epochs, architectures, task settings, and domains? Feldman's necessity claim is being formalized under long-tailed mixture models with a compositional mechanism, while 'When Memorization Hurts' claims the opposite, applied LLM mitigation work frames memorization as 'undesirable,' and now 'Reason to Rote' asks whether reasoning itself is memorization. These positions may not be contradictory if memorization's effects depend on training epoch, regime, task structure, and capability type — but no synthesis has been attempted. [69][70][65][66][67][71][72][87][88][89][63][64][79][168][169][80][81][170][156][13][157][158][159][68][16]
  • Yang et al. 2021 establish an exact algebraic gap between generalization error and uniform convergence in random feature models. Negrea et al.'s ICML 2020 'In Defense of Uniform Convergence' argues the framework can be salvaged through derandomization. Do derandomized uniform convergence arguments fall within the class foreclosed by the exact gap result, or do they constitute a genuine escape hatch? [25][93][94][171][172][90][91][92]
  • The NeurIPS 2025 dynamical decoupling result establishes generalization and overfitting are separable in large two-layer networks. The separation-of-timescales result provides a formal mechanism — feature learning precedes overfitting on a faster timescale. Does this timescale mechanism persist in deeper networks and transformer architectures, and if so, does it constitute the unified dynamical account that the grokking, LLM memorization dynamics, and decoupling results have each described from different angles without synthesis? [50][51][53][56][15][57][58][59][60][14][13][11][12]
  • If implicit dynamical regularization explains why diffusion models don't memorize (NeurIPS 2025 Best Paper) and Wu et al.'s SGD dynamical stability provides a parallel mechanism, is there a unified dynamical account of memorization suppression across architectures? The 'Generalization at the Edge of Stability' paper[1][2] now raises whether the same dynamical regularization account holds in the EoS regime, where gradient descent operates near sharpness boundaries with large stable learning rates — a condition that may alter the dynamical regularization landscape. [98][99][100][101][106][107][95][146][143][173][151][5][3][1][2][8]
  • PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while Yang et al.'s exact gap result says uniform convergence-style arguments are provably incapable of capturing the correct quantity. Are PAC-Bayes compression bounds a genuine escape hatch from the Nagarajan-Kolter impossibility, or do they fall within the class the exact gap result forecloses? [111][112][174][113][93][94]
  • Benign overfitting results show interpolating classifiers can generalize under certain geometric conditions. ICML 2025's 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits these conditions — do they hold broadly enough to constitute a useful theory, or do they require fine-tuned assumptions that fail in realistic multi-layer, non-linear settings? [175][176][177][178][173][179][180][181]
  • Is mechanistic interpretability a practical alternative to statistical generalization theory for understanding deep learning, or does circuit-level analysis simply answer a different question? The field now has multi-conference workshop infrastructure and a named advocate spanning safety, alignment, and academic ML — but institutional success is not theoretical displacement. The 'Beyond XAI' paradigm-shift argument[164] and the XAI formalization cluster[23][24] introduce additional pressure from the applied accountability side, further fragmenting the interpretability landscape without bridging to statistical theory. [122][123][124][125][27][33][128][129][126][133][135][136][137][138][141][164][23][24][20]
  • The grokking sub-debate now has at least four competing accounts — norm-based, dimensional phase transition, effective theory of representation learning, and compositional mechanism — that have not been formally reconciled. The NTK alignment/phase-transition paper[148] provides an external bridge between NTK dynamics and phase transitions that the grokking literature has not yet incorporated. The separation-of-timescales result[11][12] offers a fifth candidate mechanism. Is grokking a single phenomenon admitting a unified theory, or a family of distinct delayed-generalization transitions sharing surface phenomenology? [155][169][168][64][79][148][11][12]
  • The NTK evolution program now spans edge of stability[5][6], phase transitions[148], adaptive kernel predictors from feature-learning limits[149], empirical task-shift dynamics[151], and — new this cycle — a direct generalization question in the EoS regime[1][2] alongside convergence-rates and sharpness-dynamics posters[8][9]. Does this cluster converge toward the feature-learning regime that learning mechanics treats as operative in practice, or does it remain a set of disconnected extensions of the lazy-training regime? [37][150][151][154][149][5][148][3][6][1][2][8][9]
  • Pruning reduces LLM memorization as an engineering defense[156], training data pruning counterintuitively improves fact memorization[80], and a Springer paper combining pruning with differential privacy[22] bridges the engineering mitigation cluster to the formal memorization-privacy nexus introduced in item 3189. Do these pruning-based results share a common theoretical account, or do they represent distinct mechanisms that happen to involve data or weight reduction? [156][157][80][182][22][63]

Status: active and growing

Sources

  1. [1] Generalization at the Edge of Stability — reactive:deep-learning-theory-limits
  2. [2] [2604.19740] Generalization at the Edge of Stability - arXiv — reactive:deep-learning-theory-limits
  3. [3] Generalization at the Edge of Stability (Apr 2026) - YouTube — reactive:deep-learning-theory-limits
  4. [4] Generalization at the Edge of Stability — reactive:deep-learning-theory-limits
  5. [5] [2507.12837] Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability — reactive:deep-learning-theory-limits
  6. [6] Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability | OpenReview — reactive:deep-learning-theory-limits
  7. [7] [PDF] Understanding the Evolution of the Neural Tangent Kernel at ... — reactive:deep-learning-theory-limits
  8. [8] NeurIPS Poster Convergence Rates for Gradient Descent on the Edge of Stability for Overparametrised Least Squares — reactive:deep-learning-theory-limits
  9. [9] ICML Poster Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More — reactive:deep-learning-theory-limits
  10. [10] Marc Mézard Receives NeurIPS 2025 Award | Bocconi University — reactive:deep-learning-theory-limits
  11. [11] Overparameterized neural networks: Feature learning precedes ... — reactive:deep-learning-theory-limits
  12. [12] Separation of timescales controls feature learning and overfitting in large neural networks - Harvard Math — reactive:deep-learning-theory-limits
  13. [13] [PDF] Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models | Semantic Scholar — reactive:deep-learning-theory-limits
  14. [14] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
  15. [15] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
  16. [16] [PDF] Reason to Rote: Rethinking Memorization in Reasoning — reactive:deep-learning-theory-limits
  17. [17] The Practitioner’s Guide to the Maximal Update Parameterization - Cerebras — reactive:deep-learning-theory-limits
  18. [18] microsoft/mup: maximal update parametrization (µP) - GitHub — reactive:deep-learning-theory-limits
  19. [19] µTransfer: A technique for hyperparameter tuning of enormous ... — reactive:deep-learning-theory-limits
  20. [20] Mech Interp Workshop Reviewer Expression of Interest (ICML 2026) — reactive:deep-learning-theory-limits
  21. [21] Andriy Burkov's Post - LinkedIn — reactive:deep-learning-theory-limits
  22. [22] Accelerating Language Models with Pruning and Differentially Private ... — reactive:deep-learning-theory-limits
  23. [23] Explainable AI needs formalization | npj Artificial Intelligence — reactive:deep-learning-theory-limits
  24. [24] Inherently Interpretable Machine Learning: A Contrasting Paradigm to Post-hoc Explainable AI | Business & Information Systems Engineering | Springer Nature Link — reactive:deep-learning-theory-limits
  25. [25] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
  26. [26] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
  27. [27] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
  28. [28] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
  29. [29] The paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
  30. [30] The other paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
  31. [31] Lawrence Chan — reactive:deep-learning-theory-limits
  32. [32] ‪Lawrence Chan‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
  33. [33] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
  34. [34] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
  35. [35] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
  36. [36] Daniel Kunin — reactive:deep-learning-theory-limits
  37. [37] Understanding the Evolution of the Neural Tangent Kernel at the ... — reactive:deep-learning-theory-limits
  38. [38] There Will Be a Scientific Theory of Deep Learning - Imbue — reactive:deep-learning-theory-limits
  39. [39] (PDF) There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  40. [40] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — LessWrong — reactive:deep-learning-theory-limits
  41. [41] There Will Be a Scientific Theory of Deep Learning [R] - Reddit — reactive:deep-learning-theory-limits
  42. [42] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  43. [43] There Will Be a Scientific Theory of Deep Learning | Cool Papers — reactive:deep-learning-theory-limits
  44. [44] [2604.21691] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  45. [45] There Will Be a Scientific Theory of Deep Learning | alphaXiv — reactive:deep-learning-theory-limits
  46. [46] There Will Be a Scientific Theory of Deep Learning | Takara TLDR — reactive:deep-learning-theory-limits
  47. [47] There Will Be a Scientific Theory of Deep Learning - YouTube — reactive:deep-learning-theory-limits
  48. [48] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  49. [49] Learning Mechanics and the Second Formation of Deep Learning ... — reactive:deep-learning-theory-limits
  50. [50] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
  51. [51] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  52. [52] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  53. [53] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
  54. [54] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  55. [55] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  56. [56] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
  57. [57] NeurIPS Oral Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
  58. [58] A Dynamical Theory of Overfitting and Generalization in Large Two-Layer Networks — reactive:deep-learning-theory-limits
  59. [59] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  60. [60] [PDF] Dynamical decoupling of generalization and overfitting in large two ... — reactive:deep-learning-theory-limits
  61. [61] Dynamical Decoupling of Generalization and Overfitting in Large... — reactive:deep-learning-theory-limits
  62. [62] Vitaly Feldman's personal homepage — reactive:deep-learning-theory-limits
  63. [63] (PDF) Necessary Memorization in Overparameterized Learning ... — reactive:deep-learning-theory-limits
  64. [64] Memorizing Long-tail Data Can Help Generalization Through ... — reactive:deep-learning-theory-limits
  65. [65] Is Memorization Actually Necessary for Generalization - OpenReview — reactive:deep-learning-theory-limits
  66. [66] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
  67. [67] [PDF] WHEN MEMORIZATION HURTS GENERALIZATION — reactive:deep-learning-theory-limits
  68. [68] Undesirable Memorization in Large Language Models: A Survey — reactive:deep-learning-theory-limits
  69. [69] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
  70. [70] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
  71. [71] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
  72. [72] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
  73. [73] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
  74. [74] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
  75. [75] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
  76. [76] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
  77. [77] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
  78. [78] Does Learning Require Memorization? A Short Tale About A Long Tail — reactive:deep-learning-theory-limits
  79. [79] Memorizing Long-tail Data Can Help Generalization Through Composition — reactive:deep-learning-theory-limits
  80. [80] [PDF] Training Data Pruning Improves Memorization of Facts - arXiv — reactive:deep-learning-theory-limits
  81. [81] What is the role of memorization in Continual Learning? | OpenReview — reactive:deep-learning-theory-limits
  82. [82] ‪Vitaly Feldman‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
  83. [83] What Neural Networks Memorize and Why (Vitaly Feldman) - NeurIPS — reactive:deep-learning-theory-limits
  84. [84] Vitaly Feldman's personal homepage — reactive:deep-learning-theory-limits
  85. [85] ‪Vitaly Feldman‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
  86. [86] [PDF] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
  87. [87] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
  88. [88] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
  89. [89] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
  90. [90] [PDF] In Defense of Uniform Convergence: Generalization via ... — reactive:deep-learning-theory-limits
  91. [91] [1912.04265] In Defense of Uniform Convergence - arXiv — reactive:deep-learning-theory-limits
  92. [92] In Defense of Uniform Convergence: Generalization via Derandomization with an Application to Interpolating Predictors — reactive:deep-learning-theory-limits
  93. [93] [PDF] Exact Gap between Generalization Error and Uniform Convergence ... — reactive:deep-learning-theory-limits
  94. [94] [2103.04554] Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models — reactive:deep-learning-theory-limits
  95. [95] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
  96. [96] [Quick Review] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
  97. [97] Announcing the NeurIPS 2025 Best Paper Awards – NeurIPS Blog — reactive:deep-learning-theory-limits
  98. [98] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
  99. [99] NeurIPS Poster Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
  100. [100] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
  101. [101] Why diffusion models in generative AI don’t memorize: The role of implicit dynamical regularization in training | LPENS — reactive:deep-learning-theory-limits
  102. [102] The Role of Implicit Dynamical Regularization in Training - YouTube — reactive:deep-learning-theory-limits
  103. [103] [NeurIPS 2025] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
  104. [104] Why Diffusion Models Generalize Instead of Just Memorizing - Medium — reactive:deep-learning-theory-limits
  105. [105] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  106. [106] [2505.17638] Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
  107. [107] Best paper NeurIPS 2025: 𝗪𝗵𝘆 𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹𝘀 𝗗𝗼𝗻’𝘁 𝗠𝗲𝗺𝗼𝗿𝗶𝘇𝗲: 𝗧𝗵𝗲 𝗥𝗼𝗹𝗲 𝗼𝗳 𝗜𝗺𝗽𝗹𝗶𝗰𝗶𝘁 𝗗𝘆𝗻𝗮𝗺𝗶𝗰𝗮𝗹 𝗥𝗲𝗴𝘂𝗹𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗶𝗻… | Charles H. Martin, PhD — reactive:deep-learning-theory-limits
  108. [108] (PDF) Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  109. [109] [PDF] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  110. [110] [PDF] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  111. [111] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
  112. [112] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
  113. [113] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
  114. [114] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
  115. [115] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
  116. [116] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
  117. [117] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
  118. [118] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
  119. [119] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
  120. [120] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
  121. [121] Stability (learning theory) — reactive:deep-learning-theory-limits
  122. [122] Mechanistic Interpretability Named MIT's 2026 Breakthrough for ... — reactive:deep-learning-theory-limits
  123. [123] Bridging the Black Box: A Survey on Mechanistic Interpretability in AI — reactive:deep-learning-theory-limits
  124. [124] Understanding Mechanistic Interpretability in AI Models - IntuitionLabs — reactive:deep-learning-theory-limits
  125. [125] AI Safety, Alignment, and Interpretability in 2026 | Zylos Research — reactive:deep-learning-theory-limits
  126. [126] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits
  127. [127] MIT Technology Review's Post - Mechanistic interpretability - LinkedIn — reactive:deep-learning-theory-limits
  128. [128] Mechanistic Interpretability Workshop at ICML 2026 — reactive:deep-learning-theory-limits
  129. [129] Open problems in mechanistic interpretability: 2026 status report - Gist — reactive:deep-learning-theory-limits
  130. [130] Improving AI models' ability to explain their predictions | MIT News — reactive:deep-learning-theory-limits
  131. [131] Mech Interp Workshop @ ICML 2026 - Buttondown — reactive:deep-learning-theory-limits
  132. [132] Schedule — reactive:deep-learning-theory-limits
  133. [133] Neel Nanda 's Post - LinkedIn — reactive:deep-learning-theory-limits
  134. [134] ICML 2026 Workshop Mech Interp - OpenReview — reactive:deep-learning-theory-limits
  135. [135] Announcing the ICML 2026 Workshops and Affinity Workshops – ICML Blog — reactive:deep-learning-theory-limits
  136. [136] Call for Papers | Mechanistic Interpretability Workshop at ICML 2026 — reactive:deep-learning-theory-limits
  137. [137] Announcing the ICML 2026 Mechanistic Interpretability Workshop ... — reactive:deep-learning-theory-limits
  138. [138] Workshop on Mechanistic Interpretability - NeurIPS 2026 — reactive:deep-learning-theory-limits
  139. [139] An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers — Neel Nanda — reactive:deep-learning-theory-limits
  140. [140] Mechanistic Interpretability — Neel Nanda — reactive:deep-learning-theory-limits
  141. [141] Actionable Interpretability — reactive:deep-learning-theory-limits
  142. [142] [2407.14985] Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
  143. [143] ICLR Poster Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
  144. [144] Generalization v.s. Memorization: Tracing Language Models'... — reactive:deep-learning-theory-limits
  145. [145] [PDF] generalization v.s. memorization - arXiv — reactive:deep-learning-theory-limits
  146. [146] For Better or for Worse, Transformers Seek Patterns for Memorization | OpenReview — reactive:deep-learning-theory-limits
  147. [147] [PDF] generalization v.s. memorization: tracing language models ... — reactive:deep-learning-theory-limits
  148. [148] NeurIPS Poster A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression — reactive:deep-learning-theory-limits
  149. [149] ICML Poster Adaptive kernel predictors from feature-learning infinite limits of neural networks — reactive:deep-learning-theory-limits
  150. [150] Disentangling feature and lazy training in deep neural networks — reactive:deep-learning-theory-limits
  151. [151] Reactivation: Empirical NTK Dynamics Under Task Shifts — reactive:deep-learning-theory-limits
  152. [152] [PDF] Neural Tangent Kernel - Washington — reactive:deep-learning-theory-limits
  153. [153] [1811.04918] Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers — reactive:deep-learning-theory-limits
  154. [154] Internal dynamics of neural networks through the NTK lens — reactive:deep-learning-theory-limits
  155. [155] Grokking Beyond the Euclidean Norm of Model Parameters — reactive:deep-learning-theory-limits
  156. [156] Pruning as a Defense: Reducing Memorization in Large Language Models — reactive:deep-learning-theory-limits
  157. [157] [2502.15796] Pruning as a Defense: Reducing Memorization in Large Language Models — reactive:deep-learning-theory-limits
  158. [158] Mitigating Memorization in Language Models — reactive:deep-learning-theory-limits
  159. [159] [PDF] MITIGATING MEMORIZATION IN LANGUAGE MODELS — reactive:deep-learning-theory-limits
  160. [160] Generalization potential of large language models | Neural Computing and Applications | Springer Nature Link — reactive:deep-learning-theory-limits
  161. [161] (PDF) Beyond Explainable AI (XAI): An Overdue Paradigm Shift and ... — reactive:deep-learning-theory-limits
  162. [162] [2602.24176] Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions — reactive:deep-learning-theory-limits
  163. [163] Inside the AI Black Box, for Real This Time — The 2026 State of AI ... — reactive:deep-learning-theory-limits
  164. [164] Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions — reactive:deep-learning-theory-limits
  165. [165] A comprehensive survey on explainable artificial intelligence and ... — reactive:deep-learning-theory-limits
  166. [166] From Explainability to Control: The 2026 Executive View of AI Interpretability and Explainability — reactive:deep-learning-theory-limits
  167. [167] u-$μ$P: The Unit-Scaled Maximal Update Parametrization - arXiv — reactive:deep-learning-theory-limits
  168. [168] [PDF] Towards Understanding Grokking: An Effective Theory of ... — reactive:deep-learning-theory-limits
  169. [169] Grokking as Dimensional Phase Transition in Neural Networks — reactive:deep-learning-theory-limits
  170. [170] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  171. [171] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
  172. [172] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
  173. [173] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
  174. [174] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
  175. [175] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
  176. [176] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
  177. [177] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
  178. [178] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
  179. [179] NeurIPS Benign Overfitting in Out-of-Distribution Generalization of Linear Models — reactive:deep-learning-theory-limits
  180. [180] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
  181. [181] Mind the spikes: Benign overfitting of kernels and neural ... — reactive:deep-learning-theory-limits
  182. [182] Prune Training Data to Maximize LLM Factual Memorization | Changecast — reactive:deep-learning-theory-limits
  183. [183] Uniform convergence may be unable to explain generalization in ... — reactive:deep-learning-theory-limits
  184. [184] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
  185. [185] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
  186. [186] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
  187. [187] Understanding Deep Learning (Still) Requires Rethinking ... — reactive:deep-learning-theory-limits
  188. [188] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
  189. [189] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
  190. [190] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits
  191. [191] Reactivation: Empirical NTK Dynamics Under Task Shifts — reactive:deep-learning-theory-limits
  192. [192] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  193. [193] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  194. [194] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  195. [195] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  196. [196] [2506.07998] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  197. [197] Generative Modeling of Neural Network Weights - Emergent Mind — reactive:deep-learning-theory-limits
  198. [198] [PDF] Memorizing Long-tail Data Can Help Generalization Through ... - arXiv — reactive:deep-learning-theory-limits
  199. [199] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits
  200. [200] Workshop on Mechanistic Interpretability - ICML 2026 — reactive:deep-learning-theory-limits
  201. [201] ICML 2026 Schedule — reactive:deep-learning-theory-limits
  202. [202] 2026 Conference — reactive:deep-learning-theory-limits
  203. [203] ICML 2026 Workshops — reactive:deep-learning-theory-limits
  204. [204] Training Data Pruning Improves Memorization of Facts — reactive:deep-learning-theory-limits
  205. [205] [PDF] TRAINING DATA PRUNING IMPROVES MEMORIZATION OF FACTS — reactive:deep-learning-theory-limits
  206. [206] [2604.08519] Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts — reactive:deep-learning-theory-limits