The Information Machine

Deep Learning Theory Is Broken — And Maybe Unfixable · history

Version 10

2026-05-02 14:06 UTC · 314 items

Narrative

Two institutional confirmations anchor this cycle. The official NeurIPS Blog post announcing the NeurIPS 2025 Best Paper Awards[1] provides the canonical source for the diffusion model implicit dynamical regularization award previously tracked from secondary coverage[2] — closing the chain of evidence from preprint to official program-committee record. Mechanistic interpretability's institutional expansion has acquired a second major conference anchor: a Workshop on Mechanistic Interpretability at NeurIPS (the URL routes through the NeurIPS 2025 virtual system, though the page title references 2026)[3], distinct from the already-tracked ICML 2026 workshop announced by Neel Nanda[4]. Combined with an 'Actionable Interpretability' workshop at ICML 2025[5], the interpretability program now has confirmed workshop presence across at least three consecutive conference slots — a pace of institutional self-organization without parallel in the statistical generalization theory program, which has produced tier-1 papers but has not translated them into recurring conference infrastructure.

The NTK evolution program has opened two new fronts that together begin to connect it to the grokking and phase-transition debates. 'Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability'[6] (arxiv 2507.12837) extends NTK dynamics analysis to the edge-of-stability training regime — where gradient descent operates with large stable learning rates near sharpness boundaries — connecting the NTK evolution strand to a separately active theoretical program around catapult dynamics. 'A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression'[7], a NeurIPS 2025 poster, directly links NTK alignment dynamics to phase transitions in a regression setting, providing a new bridge between the NTK evolution program and the grokking/phase-transition strand that the ICML 2025 norm-critique[8] and dimensional phase transition papers[9] have been developing independently but without this connection. An ICML 2025 poster on 'Adaptive kernel predictors from feature-learning infinite limits of neural networks'[10] suggests the infinite-width limit frameworks underlying the NTK program can be extended to accommodate feature learning, potentially bridging the lazy-training and feature-learning regimes that learning mechanics treats as categorically distinct. A paper in IOP Machine Learning: Science and Technology on internal NTK dynamics through the NTK lens[11] marks the NTK evolution program's arrival in physics-adjacent journals, paralleling how the Montanari-Urbani result reached the Simons Institute and Stanford Mathematics.

An applied engineering cluster has emerged around reducing memorization in deployed LLMs, representing a parallel response to the theoretical memorization debate that is institutionally separate from both the Feldman necessity program and the training-dynamics generalization research. 'Mitigating Memorization in Language Models'[12][13] and 'Pruning as a Defense: Reducing Memorization in Large Language Models'[14][15] treat LLM memorization as an engineering problem to be suppressed rather than a theoretical property to be explained. 'Undesirable Memorization in Large Language Models: A Survey'[16] frames the phenomenon as inherently harmful — a stance that implicitly contradicts Feldman's thesis that memorization is necessary for generalization from long-tailed distributions. 'Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models'[17] (Tirumala et al., NeurIPS 2022) establishes that in LLMs, memorization peaks early and decreases while generalization continues to improve — a temporal separation that echoes grokking's sequential memorize-then-generalize structure but in a large-scale pretraining context. The 'Generalization potential of large language models'[18] paper addresses whether LLMs genuinely generalize or extrapolate from training distributions, extending the memorization-versus-generalization question to the deployed-model regime. These applied positions have not yet engaged the theoretical memorization-necessity arguments directly, but the 'undesirable memorization' framing is in tension with Feldman's affirmative claim.

The 'Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions' preprint (arxiv 2602.24176)[19] introduces a third direction on interpretability: it argues the XAI paradigm has exhausted itself and a new framework is needed, pressuring from the applied AI accountability side rather than the circuit-level or statistical theory sides. Lawrence Chan's Google Scholar profile is now indexed[20] alongside the ResearchGate page for 'There Will Be a Scientific Theory of Deep Learning'[21], confirming both the skeptic and the optimistic response are being actively surveyed as reference points. The thread's arc has sharpened into a four-strand topology: the training-dynamics cluster holds the thread's highest institutional recognition (NeurIPS 2025 Best Paper); mechanistic interpretability holds the most conference infrastructure (three workshop slots); the applied LLM memorization mitigation cluster holds engineering traction; and classical generalization bounds remain the only major strand without a corresponding institutional upgrade. Whether the NTK program's new phase-transition and edge-of-stability papers provide bridging between these strands or constitute yet another independent sub-debate is the most consequential structural question entering the next cycle.

Timeline

  • 2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [23]
  • 2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [22][145][146][155][156]
  • 2019-06-01: Feldman publishes 'Does Learning Require Memorization? A Short Tale about a Long Tail,' arguing memorization of tail examples is causally necessary for learning from long-tailed distributions. [64][69][157][70][71][72][73][81]
  • 2020-07-01: Negrea et al. publish 'In Defense of Uniform Convergence' at ICML 2020, arguing that uniform convergence can be partially recovered through derandomization applied to interpolating classifiers. [85][86][87]
  • 2020-11-01: 'Disentangling Feature and Lazy Training in Deep Neural Networks' provides a formal framework for distinguishing when networks operate in the NTK (lazy) regime versus learning new representations (feature learning), directly engaging the axis that learning mechanics treats as the central diagnostic for whether classical theory applies. [138]
  • 2021-01-01: Yang et al. establish an exact algebraic gap between generalization error and the tightest possible uniform convergence bound in random feature models, giving Nagarajan-Kolter a precise formal complement. [88][89]
  • 2021-01-01: ACM Communications publishes the canonical journal version of Zhang et al.'s rethinking-generalization result, cementing it as a textbook-permanent finding rather than a contested empirical claim. [158]
  • 2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to actually explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [104][105][106]
  • 2022-01-01: NeurIPS 2022 paper 'Towards Understanding Grokking: An Effective Theory of Representation Learning' provides an effective-theory account of delayed generalization, framing grokking as a consequence of representation learning dynamics. [143]
  • 2022-01-01: Tirumala et al. publish 'Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models' (NeurIPS 2022), finding that memorization in LLMs peaks early in training and subsequently decreases while generalization continues to improve — a temporal separation that echoes grokking's sequential structure in the large-scale pretraining regime. [17]
  • 2023-07-01: Wu et al. publish 'The Implicit Regularization of Dynamical Stability in SGD' at ICML 2023, showing that SGD's dynamical stability provides an implicit regularizer that suppresses memorization — bridging the algorithmic stability and implicit regularization research strands. [90][91]
  • 2024-01-01: NeurIPS 2024 paper on symmetries in overparameterized neural networks using a mean-field view offers a new structural lens on why overparameterization does not prevent generalization. [159][160]
  • 2024-07-01: u-μP: The Unit-Scaled Maximal Update Parametrization (arxiv 2407.17465) refines the μP framework by incorporating unit scaling, representing continued active development of learning mechanics' main practical engineering artifact. [142]
  • 2024-10-01: 'Mitigating Memorization in Language Models' (arxiv 2410.02159) proposes engineering techniques to reduce memorization in deployed LLMs — treating memorization as a practical harm to suppress rather than a theoretically necessary property, implicitly opposing Feldman's necessity thesis in the LLM regime. [12][13]
  • 2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales. [161]
  • 2025-01-01: Urbani delivers a seminar at the Stanford Mathematics department on generalization in two-layer neural networks, confirming the Montanari-Urbani result is circulating in pure-math venues beyond ML conferences. [58]
  • 2025-01-01: ICLR 2025 paper 'When Memorization Hurts Generalization' argues memorization can actively damage generalization performance — a stronger claim than merely saying memorization is unnecessary. [63][82][83][84]
  • 2025-01-01: NeurIPS 2025 Oral 'Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks' (Montanari and Urbani) presents results separating generalization dynamics from overfitting dynamics in the large two-layer regime; oral status confirmed, placing it in the top ~1-2% of NeurIPS 2025 submissions. [45][46][47][48][49][50][51][53][54][55][56][57]
  • 2025-01-01: NeurIPS 2025 Best Paper 'Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training' — confirmed via the official NeurIPS Best Paper Awards blog post — attributes diffusion models' resistance to memorization to implicit dynamical regularization; this is now the most institutionally recognized result in the training-dynamics cluster. [92][93][94][95][96][97][98][99][100][2][101][102][103][1]
  • 2025-01-01: ICML 2025 poster 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits the conditions under which interpolating classifiers can generalize in the two-layer setting. [147]
  • 2025-01-01: ICLR 2025 poster 'Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data' extends the memorization-generalization debate to LLMs, examining how specific pretraining data drives capability versus memorization. [133][134][135][137][132]
  • 2025-01-01: 'Necessary Memorization in Overparameterized Learning under Long-Tailed Mixture Models: Theory and Privacy Implications' provides formal theoretical support for Feldman's necessity claim in a specific mathematical setting, and introduces differential privacy leakage as a downstream consequence of memorization. [59]
  • 2025-01-01: NeurIPS 2025 poster 'Understanding the Evolution of the Neural Tangent Kernel' tracks how the NTK changes during training rather than treating it as a fixed linearization, engaging the 'lazy training vs. feature learning' distinction central to the learning mechanics critique of classical theory. [33]
  • 2025-01-01: 'Reactivation: Empirical NTK Dynamics Under Task Shifts' empirically tracks NTK dynamics as tasks change, connecting the NTK evolution program to the continual learning setting. [139]
  • 2025-01-01: 'Generative Modeling of Weights: Generalization or Memorization?' opens a new subdomain within the memorization debate: whether generative models of neural network weight distributions can generalize to unseen architectures or merely memorize training-set weight configurations. The paper has acquired dissemination across HuggingFace, a project page, and Emergent Mind, indicating growing community attention. [144][162][163][164][165][166][167]
  • 2025-01-01: A Workshop on Mechanistic Interpretability appears in the NeurIPS 2025 virtual infrastructure (the URL routes through the NeurIPS 2025 system, though the page title references 2026), marking the field's confirmed presence at NeurIPS alongside the separately-tracked ICML 2026 workshop. [3]
  • 2025-01-01: NeurIPS 2025 poster 'A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression' directly connects NTK alignment dynamics to phase transitions, providing a new bridge between the NTK evolution program and the grokking/phase-transition research strand. [7]
  • 2025-01-01: An 'Actionable Interpretability' workshop at ICML 2025 adds a second distinct interpretability-focused workshop slot to the same conference, representing a parallel institutional track to the mechanistic interpretability workshop program. [5]
  • 2025-01-01: ICML 2025 poster 'Adaptive kernel predictors from feature-learning infinite limits of neural networks' suggests the infinite-width limit frameworks underlying the NTK program can accommodate feature learning, potentially bridging the lazy-training and feature-learning regimes. [10]
  • 2025-02-01: 'Pruning as a Defense: Reducing Memorization in Large Language Models' (arxiv 2502.15796) demonstrates that model pruning can reduce LLM memorization as a practical defense — extending the engineering memorization mitigation cluster to compression-based techniques. [14][15]
  • 2025-02-19: Pierfrancesco Urbani (CNRS) delivers an invited talk at the Simons Institute (Berkeley) on 'Generalization and overfitting in two-layer neural networks,' extending the dynamical decoupling result's reach to a major theory venue. [52]
  • 2025-07-01: 'Grokking Beyond the Euclidean Norm of Model Parameters' (ICML 2025 poster) challenges the L2-norm-based account of the memorization-to-generalization transition in grokking, suggesting the phase transition involves dynamics not fully captured by weight norm trajectories. [8]
  • 2025-07-01: 'Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability' (arxiv 2507.12837) extends NTK dynamics analysis to the edge-of-stability training regime — where gradient descent operates near sharpness boundaries with large stable learning rates — connecting NTK evolution to the separately active catapult/sharpness-reduction theoretical program. [6]
  • 2025-10-01: 'Memorizing Long-tail Data Can Help Generalization Through Composition' (arxiv 2510.16322) proposes that the mechanism by which tail memorization aids generalization is compositional rather than coverage-based, offering a mechanistic refinement of Feldman's thesis. [60][74]
  • 2026-01-01: MIT names mechanistic interpretability as its 2026 Breakthrough of the Year, institutionally recognizing circuit-level understanding as a viable alternative program to statistical generalization theory for understanding deep learning. [117][121][122][168]
  • 2026-01-01: A dedicated ICML 2026 workshop on mechanistic interpretability is announced by Neel Nanda (DeepMind) on LinkedIn and X/Twitter, with infrastructure confirmed via Buttondown newsletter, official schedule page, OpenReview submission portal, and ICML Blog; and an 'Open problems in mechanistic interpretability: 2026 status report' is published as a GitHub gist. [123][124][126][127][4][128][169][170][129][171][172][116][115]
  • 2026-02-01: 'Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions' (arxiv 2602.24176) argues the XAI paradigm has exhausted itself and a new framework is needed — introducing a third interpretability-adjacent pressure on deep learning theory, distinct from both statistical generalization theory and circuit-level mechanistic interpretability, from the applied AI accountability side. [19]
  • 2026-03-01: MIT News covers research on improving AI models' ability to explain their predictions, marking science-communication investment in interpretability research that the statistical generalization theory program has not received. [125]
  • 2026-04-01: 'Grokking as Dimensional Phase Transition in Neural Networks' (arxiv 2604.04655) introduces a new theoretical framework for delayed generalization, framing the memorization-to-generalization transition as a phase transition in the dimensionality of learned representations. [9]
  • 2026-04-01: 'Training Data Pruning Improves Memorization of Facts' (arxiv 2604.08519) presents a counterintuitive finding: selective reduction of training data can improve how well models memorize individual facts, suggesting data redundancy may suppress rather than reinforce fact memorization. [75][173][174][175]
  • 2026-04-24: Jamie Simon and Daniel Kunin (UC Berkeley) appear on Imbue's podcast arguing that a scientific theory of deep learning is achievable, marking the first direct public advocacy by the learning mechanics authors. [29][34]
  • 2026-04-25: LawrenceC (confirmed as Lawrence Chan via Google Scholar) publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [24][20]
  • 2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [23][26]
  • 2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization; the post is crossposted to LessWrong and drives renewed interest in the original paper. [22][25][27]
  • 2026-04-28: 'There Will Be a Scientific Theory of Deep Learning' (arxiv 2604.21691) appears as a late-April 2026 preprint offering the first formal academic-paper-level optimistic response to the thread's founding pessimism, its title directly inverting LawrenceC's framing. The paper has accumulated a ResearchGate page indicating cross-platform indexing. [35][37][21]
  • 2026-04-30: At least three distinct OpenReview submissions titled 'Is Memorization Actually Necessary for Generalization?' appear, representing independent formal challenges to Feldman's affirmative claim. [65][61][62][66][67]
  • 2026-05-01: A LessWrong 'Quick Paper Review' of 'There Will Be a Scientific Theory of Deep Learning' appears on the same platform as LawrenceC's original critique posts, and a Reddit r/MachineLearning thread opens discussion of the preprint — marking its transition from indexed to actively engaged within the communities that originally amplified the pessimistic framing. [42][43]

Perspectives

LawrenceC (Alignment Forum / LessWrong) — confirmed as Lawrence Chan

Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling (μP), explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.

Evolution: Consistent across all three posts. Google Scholar profile now indexed[20], confirming this is Lawrence Chan. The NeurIPS 2025 Best Paper for the diffusion models result and u-μP's active development partially address his specific critique about μP being learning mechanics' only practical output, but his broader skepticism about a comprehensive theory remains unanswered.

Jamie Simon and Daniel Kunin (UC Berkeley / learning mechanics)

A scientific theory of deep learning is achievable; their 'learning mechanics' framework, grounded in average-case training dynamics and aggregate statistics, is the right approach. Publicly promoted via Imbue podcast (April 24, 2026), YouTube presentation, and ongoing Twitter activity.

Evolution: Consistent; no direct public response to LawrenceC's critique yet apparent. The NeurIPS 2025 Best Paper for implicit dynamical regularization and the NTK evolution/disentangling papers both converge on the feature-learning vs. lazy-training axis that learning mechanics treats as central. The new NTK edge-of-stability and phase-transition papers[6][7] provide additional external convergence on the dynamics-first framing.

'There Will Be a Scientific Theory of Deep Learning' authors (April 2026 preprint)

A scientific theory of deep learning is achievable — the title directly inverts the pessimistic framing of LawrenceC's series. The formal academic-paper format marks this as the first non-blog-post optimistic response in this cycle.

Evolution: Previously receiving LessWrong and Reddit engagement. This cycle: ResearchGate indexing[21] confirms cross-platform amplification. The NeurIPS 2025 Best Paper for the dynamical regularization result provides indirect empirical support for their thesis, and no formal critical response has emerged.

Independent Medium commentary ('second formation' framing)

Learning mechanics represents a genuine paradigm transition — a 'second formation' of deep learning theory analogous to statistical mechanics succeeding classical mechanics — not merely an incremental research program.

Evolution: Consistent from prior cycle. More optimistic than Simon and Kunin's own public claims and directly contradicts LawrenceC's skepticism about practical utility.

Montanari and Urbani (dynamical decoupling)

In the large two-layer network regime, generalization dynamics and overfitting dynamics are separable — a result that, if it extends to deeper networks, would provide an algorithm-dependent structural account of why networks generalize despite interpolating training data.

Evolution: Consistent from prior cycles. NeurIPS 2025 Oral status and cross-disciplinary reach (Simons Institute, Stanford Mathematics) confirmed in prior cycle; no new developments this cycle.

Vitaly Feldman (memorization is necessary)

Memorization of tail examples is causally necessary for learning from long-tailed distributions. This reframes Zhang et al.'s result: memorization is part of learning, not evidence that theory is broken.

Evolution: Under challenge from multiple directions simultaneously: theoretical formalization proceeds under long-tailed mixture models[59], the compositional mechanism account[60] refines the claim's mechanism, three OpenReview challenges contest the necessity claim[61][62], 'When Memorization Hurts' takes the opposite position[63], and applied LLM work frames memorization as 'undesirable'[16] — treating it as a practical harm rather than a learning necessity. The applied cluster has not engaged Feldman's formal arguments directly, leaving the formal vs. engineering debate unresolved.

ICLR 2025 'When Memorization Hurts Generalization' authors

Memorization is not merely unnecessary for generalization but can actively damage it — the strongest anti-Feldman position yet to appear in a peer-reviewed venue.

Evolution: Consistent from prior cycle.

Negrea et al. (defense of uniform convergence via derandomization)

Uniform convergence arguments can be partially recovered through derandomization techniques applied to interpolating classifiers, suggesting the Nagarajan-Kolter impossibility is not a blanket closure of the uniform convergence program.

Evolution: Consistent from prior cycle. Whether this result engages or sidesteps Yang et al.'s exact gap result remains unresolved.

Yang et al. 2021 (exact gap, uniform convergence)

The failure of uniform convergence in random feature models is not an artifact of loose bounds but is provably exact — there is a measurable algebraic gap between any uniform convergence bound and the true generalization error, even in principle.

Evolution: Consistent from prior cycle. The Negrea et al. defense creates an unresolved question about whether the exact gap forecloses derandomized arguments.

Wu et al. ICML 2023 (implicit regularization of dynamical stability, SGD)

SGD's dynamical stability provides an implicit regularizer that suppresses memorization and promotes generalization — a mechanistic account connecting training-dynamics arguments to the diffusion models result and the dynamical decoupling program.

Evolution: Consistent from prior cycle. Now sits as the middle tier of a three-tier training-dynamics cluster (ICML 2023 paper < NeurIPS 2025 Oral < NeurIPS 2025 Best Paper) that has collectively accumulated more top-venue distinction than the generalization bounds program in 2023–2025.

NeurIPS 2025 diffusion model memorization researchers (Best Paper)

Implicit dynamical regularization during training prevents memorization in diffusion models — a training-dynamics explanation for a phenomenon that would otherwise require architectural or data-geometric accounts.

Evolution: Confirmed as NeurIPS 2025 Best Paper via the official NeurIPS Blog post[1], providing the canonical institutional source that secondary coverage previously supplied. ENS-Lyon seminar extends geographic reach to European physics venues.

PAC-Bayes compression bounds researchers (NeurIPS 2022)

PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.

Evolution: Consistent; no direct engagement with Yang et al.'s exact gap result or Negrea et al.'s derandomization argument yet apparent.

Algorithmic stability / SGD stability researchers

Generalization in deep learning can be explained through the stability properties of SGD. This approach is inherently algorithm- and data-dependent, directly addressing the Nagarajan-Kolter critique that uniform convergence ignores gradient descent's inductive bias.

Evolution: Consistent; Wu et al. ICML 2023 bridges this cluster to the implicit regularization and diffusion memorization strands, providing a shared mechanistic account.

Mechanistic interpretability community (MIT 2026 Breakthrough, Neel Nanda, multi-conference)

Understanding deep learning through circuit-level mechanistic analysis of what algorithms individual networks implement is a viable — and now institutionally recognized — alternative to statistical generalization theory.

Evolution: Previously anchored by MIT's Breakthrough designation, the ICML 2026 workshop announcement, and Neel Nanda's LinkedIn advocacy. This cycle: a Workshop on Mechanistic Interpretability in the NeurIPS 2025 virtual system[3] and an 'Actionable Interpretability' workshop at ICML 2025[5] extend the institutional presence to at least three consecutive conference slots. The X/Twitter announcement by Neel Nanda[115] confirms cross-platform amplification, and the Call for Papers page[116] confirms active submission infrastructure. The field is now operating with self-sustaining multi-conference workshop infrastructure.

ICLR 2025 LLM memorization researchers

Language models' generalization capabilities can be traced back to specific pretraining data, implying a causal data-memorization link that extends Feldman's long-tail thesis to the LLM regime.

Evolution: Consistent from prior cycle. arxiv page now indexed[132].

NTK evolution researchers

The Neural Tangent Kernel is not static during training — understanding how it evolves during training connects the classical NTK linearization program to the feature learning regime that learning mechanics treats as operative in practice.

Evolution: Previously entered via a NeurIPS 2025 poster (item 4121). This cycle: 'Understanding the Evolution of the NTK at the Edge of Stability'[6] extends the program to the edge-of-stability training regime; 'A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression'[7] connects the program to the phase-transition strand; 'Adaptive kernel predictors from feature-learning infinite limits'[10] bridges the NTK and feature-learning regimes; and an IOP physics-adjacent journal publishes NTK dynamics work[11] — indicating the program is now multi-directional rather than a single-paper strand.

ICML 2025 grokking norm-critique researchers

The memorization-to-generalization transition in grokking is not fully explained by Euclidean weight norm dynamics, suggesting existing norm-based theoretical accounts are incomplete.

Evolution: Consistent from prior cycle. The new NTK alignment/phase-transition paper[7] now provides an adjacent bridge between the NTK and phase-transition programs that the norm-critique paper challenged without offering.

Applied LLM memorization mitigation researchers

Memorization in deployed LLMs is an 'undesirable' engineering problem to be suppressed through techniques including pruning, training interventions, and data curation — a framing that implicitly treats memorization as harmful, opposing Feldman's necessity thesis at the applied level even without engaging it formally.

Evolution: New voice this cycle, entering via a cluster of applied papers: 'Pruning as a Defense'[14], 'Mitigating Memorization in Language Models'[12], and 'Undesirable Memorization in LLMs: A Survey'[16]. Tirumala et al.'s 'Memorization Without Overfitting'[17] (NeurIPS 2022) provides the empirical foundation — showing temporal separation of memorization from generalization in LLMs — that the mitigation work builds on. This cluster has not formally engaged Feldman's theoretical necessity arguments.

Beyond XAI paradigm-shift researchers (arxiv 2602.24176)

The Explainable AI (XAI) paradigm has exhausted itself and a post-XAI framework is needed — arguing from the applied AI accountability side that the current interpretability landscape requires a paradigm shift, distinct from both statistical generalization theory and circuit-level mechanistic interpretability.

Evolution: New voice this cycle. Introduces a third direction on interpretability distinct from the mechanistic interpretability vs. statistical theory axis that has organized the thread so far.

Tensions

  • Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others permanently outside its scope? The April 2026 preprint 'There Will Be a Scientific Theory of Deep Learning' argues the question is not closed and has active community engagement in the LessWrong venue where LawrenceC's pessimism originated. The NeurIPS 2025 Best Paper for implicit dynamical regularization provides indirect empirical warrant for the learning mechanics framing, and the NTK edge-of-stability and phase-transition papers provide further external convergence on dynamics-first explanations. Whether these constitute supporting evidence for learning mechanics' feature-learning framing or constitute separate theoretical lines remains unresolved. [24][29][30][31][32][44][35][42][43][33][138][139][142][6][7]
  • Is memorization causally necessary, causally harmful, or merely correlated with generalization — and does the answer change across training epochs, architectures, task settings, and data regimes? Feldman's necessity claim is being formalized under long-tailed mixture models with a compositional mechanism account, while 'When Memorization Hurts Generalization' claims the opposite, and applied LLM mitigation work frames memorization as 'undesirable' without engaging the formal necessity arguments. Tirumala et al.'s 'Memorization Without Overfitting' shows temporal separation between memorization and generalization in LLMs, echoing grokking but in pretraining. These positions may not be contradictory if memorization's effects depend on training epoch, regime, and task structure — but no synthesis has yet been attempted. [64][65][61][62][63][66][67][82][83][84][59][60][74][143][9][75][76][144][14][17][15][12][13][16]
  • Yang et al. 2021 establish an exact algebraic gap between generalization error and uniform convergence in random feature models. Negrea et al.'s ICML 2020 'In Defense of Uniform Convergence' argues the framework can be salvaged through derandomization applied to interpolating classifiers. Do derandomized uniform convergence arguments fall within the class foreclosed by the exact gap result, or do they constitute a genuine escape hatch that keeps the classical program viable? [22][88][89][145][146][85][86][87]
  • The NeurIPS 2025 dynamical decoupling result establishes that generalization and overfitting are separable phenomena in large two-layer networks. Does this separation persist in deeper networks and in transformer architectures, and if so, does it support or complicate the learning mechanics program's focus on aggregate training dynamics? [45][46][48][51][52][53][54][55][56][58]
  • If implicit dynamical regularization explains why diffusion models don't memorize (NeurIPS 2025 Best Paper), and Wu et al.'s SGD dynamical stability result provides a parallel mechanism, is there a unified dynamical account of memorization suppression across architectures? Or does the memorization debate remain architecture-regime-specific, with separate mechanisms governing transformers, diffusion models, and two-layer networks? The NTK edge-of-stability paper[6] raises whether the same dynamical regularization account holds in high-sharpness training regimes. [92][93][94][95][100][2][90][136][133][147][139][6]
  • PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while Yang et al.'s exact gap result says uniform convergence-style arguments are provably incapable of capturing the correct quantity. Are PAC-Bayes compression bounds a genuine escape hatch from the Nagarajan-Kolter impossibility, or do they fall within the class of arguments the exact gap result forecloses? [104][105][148][106][88][89]
  • Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions. ICML 2025's 'Rethinking Benign Overfitting in Two-Layer Neural Networks' revisits these conditions — do they hold broadly enough to constitute a useful theory, or do they require fine-tuned assumptions that fail in realistic multi-layer, non-linear settings? [149][150][151][152][147][153][154]
  • Is mechanistic interpretability a practical alternative to statistical generalization theory for understanding deep learning, or does its circuit-level focus simply answer a different question? The field now has multi-conference workshop infrastructure (ICML 2026, NeurIPS workshop presence, ICML 2025 'Actionable Interpretability') and a named advocate (Neel Nanda) who spans safety, alignment, and academic ML — but institutional success is not the same as theoretical displacement, and the mechanistic interpretability and generalization theory programs have not yet engaged each other directly. The 'Beyond XAI' paradigm-shift argument introduces a third pressure from the applied accountability side, further fragmenting the interpretability landscape. [117][118][119][120][24][29][123][124][121][4][129][116][115][3][5][19]
  • The grokking sub-debate now has at least four competing accounts — norm-based, dimensional phase transition, effective theory of representation learning, and compositional mechanism — that have not been formally reconciled. 'Grokking Beyond the Euclidean Norm' challenges the norm account without proposing a replacement. The new NTK alignment/phase-transition paper[7] provides an external bridge between NTK dynamics and phase transitions that the grokking literature has not yet incorporated. Is grokking a single phenomenon admitting a unified theory, or a family of distinct delayed-generalization transitions that happen to share a surface phenomenology? [8][9][143][60][74][7]
  • The NTK evolution program now spans multiple fronts — edge of stability[6], phase transitions[7], adaptive kernel predictors from feature-learning limits[10], empirical task-shift dynamics[139], and physics-journal venues[11] — but has not yet produced a unified account of how these strands relate. Does the NTK evolution program converge toward the feature-learning regime that learning mechanics treats as operative in practice, or does it remain a set of disconnected extensions of the lazy-training regime? The adaptive kernel predictors paper[10] suggests the former is possible but has not yet been demonstrated. [33][138][139][11][10][6][7]

Sources

  1. [1] Announcing the NeurIPS 2025 Best Paper Awards – NeurIPS Blog — reactive:deep-learning-theory-limits
  2. [2] Best paper NeurIPS 2025: 𝗪𝗵𝘆 𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹𝘀 𝗗𝗼𝗻’𝘁 𝗠𝗲𝗺𝗼𝗿𝗶𝘇𝗲: 𝗧𝗵𝗲 𝗥𝗼𝗹𝗲 𝗼𝗳 𝗜𝗺𝗽𝗹𝗶𝗰𝗶𝘁 𝗗𝘆𝗻𝗮𝗺𝗶𝗰𝗮𝗹 𝗥𝗲𝗴𝘂𝗹𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗶𝗻… | Charles H. Martin, PhD — reactive:deep-learning-theory-limits
  3. [3] Workshop on Mechanistic Interpretability - NeurIPS 2026 — reactive:deep-learning-theory-limits
  4. [4] Neel Nanda 's Post - LinkedIn — reactive:deep-learning-theory-limits
  5. [5] Actionable Interpretability — reactive:deep-learning-theory-limits
  6. [6] [2507.12837] Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability — reactive:deep-learning-theory-limits
  7. [7] NeurIPS Poster A Closer Look at NTK Alignment: Linking Phase Transitions in Deep Image Regression — reactive:deep-learning-theory-limits
  8. [8] Grokking Beyond the Euclidean Norm of Model Parameters — reactive:deep-learning-theory-limits
  9. [9] Grokking as Dimensional Phase Transition in Neural Networks — reactive:deep-learning-theory-limits
  10. [10] ICML Poster Adaptive kernel predictors from feature-learning infinite limits of neural networks — reactive:deep-learning-theory-limits
  11. [11] Internal dynamics of neural networks through the NTK lens — reactive:deep-learning-theory-limits
  12. [12] Mitigating Memorization in Language Models — reactive:deep-learning-theory-limits
  13. [13] [PDF] MITIGATING MEMORIZATION IN LANGUAGE MODELS — reactive:deep-learning-theory-limits
  14. [14] Pruning as a Defense: Reducing Memorization in Large Language Models — reactive:deep-learning-theory-limits
  15. [15] [2502.15796] Pruning as a Defense: Reducing Memorization in Large Language Models — reactive:deep-learning-theory-limits
  16. [16] Undesirable Memorization in Large Language Models: A Survey — reactive:deep-learning-theory-limits
  17. [17] [PDF] Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models | Semantic Scholar — reactive:deep-learning-theory-limits
  18. [18] Generalization potential of large language models | Neural Computing and Applications | Springer Nature Link — reactive:deep-learning-theory-limits
  19. [19] Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions — reactive:deep-learning-theory-limits
  20. [20] ‪Lawrence Chan‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
  21. [21] (PDF) There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  22. [22] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
  23. [23] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
  24. [24] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
  25. [25] The other paper that killed deep learning theory — LessWrong — reactive:deep-learning-theory-limits
  26. [26] The paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
  27. [27] The other paper that killed deep learning theory — AI Alignment Forum — reactive:deep-learning-theory-limits
  28. [28] Lawrence Chan — reactive:deep-learning-theory-limits
  29. [29] Jamie Simon and Daniel Kunin, UC Berkeley: There Will Be a Scientific Theory of Deep Learning - imbue — reactive:deep-learning-theory-limits
  30. [30] There Will Be a Scientific Theory of Deep Learning (Apr 2026) — reactive:deep-learning-theory-limits
  31. [31] Quick Paper Review: "There Will Be a Scientific Theory of Deep ... — reactive:deep-learning-theory-limits
  32. [32] Daniel Kunin — reactive:deep-learning-theory-limits
  33. [33] Understanding the Evolution of the Neural Tangent Kernel at the ... — reactive:deep-learning-theory-limits
  34. [34] There Will Be a Scientific Theory of Deep Learning - Imbue — reactive:deep-learning-theory-limits
  35. [35] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  36. [36] There Will Be a Scientific Theory of Deep Learning | Cool Papers — reactive:deep-learning-theory-limits
  37. [37] [2604.21691] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  38. [38] There Will Be a Scientific Theory of Deep Learning | alphaXiv — reactive:deep-learning-theory-limits
  39. [39] There Will Be a Scientific Theory of Deep Learning | Takara TLDR — reactive:deep-learning-theory-limits
  40. [40] There Will Be a Scientific Theory of Deep Learning - YouTube — reactive:deep-learning-theory-limits
  41. [41] There Will Be a Scientific Theory of Deep Learning — reactive:deep-learning-theory-limits
  42. [42] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — LessWrong — reactive:deep-learning-theory-limits
  43. [43] There Will Be a Scientific Theory of Deep Learning [R] - Reddit — reactive:deep-learning-theory-limits
  44. [44] Learning Mechanics and the Second Formation of Deep Learning ... — reactive:deep-learning-theory-limits
  45. [45] NeurIPS Poster Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
  46. [46] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  47. [47] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  48. [48] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks | OpenReview — reactive:deep-learning-theory-limits
  49. [49] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  50. [50] [PDF] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  51. [51] Generalization and overfitting in two-layer neural networks - YouTube — reactive:deep-learning-theory-limits
  52. [52] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
  53. [53] NeurIPS Oral Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks — reactive:deep-learning-theory-limits
  54. [54] A Dynamical Theory of Overfitting and Generalization in Large Two-Layer Networks — reactive:deep-learning-theory-limits
  55. [55] Dynamical Decoupling of Generalization and Overfitting in Large ... — reactive:deep-learning-theory-limits
  56. [56] [PDF] Dynamical decoupling of generalization and overfitting in large two ... — reactive:deep-learning-theory-limits
  57. [57] Dynamical Decoupling of Generalization and Overfitting in Large... — reactive:deep-learning-theory-limits
  58. [58] Generalization and overfitting in two-layer neural networks — reactive:deep-learning-theory-limits
  59. [59] (PDF) Necessary Memorization in Overparameterized Learning ... — reactive:deep-learning-theory-limits
  60. [60] Memorizing Long-tail Data Can Help Generalization Through ... — reactive:deep-learning-theory-limits
  61. [61] Is Memorization Actually Necessary for Generalization - OpenReview — reactive:deep-learning-theory-limits
  62. [62] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
  63. [63] [PDF] WHEN MEMORIZATION HURTS GENERALIZATION — reactive:deep-learning-theory-limits
  64. [64] Does learning require memorization? a short tale about a long tail — reactive:deep-learning-theory-limits
  65. [65] Is Memorization Actually Necessary for Generalization? - OpenReview — reactive:deep-learning-theory-limits
  66. [66] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
  67. [67] [PDF] IS MEMORIZATION Actually NECESSARY FOR GENER- ALIZATION? — reactive:deep-learning-theory-limits
  68. [68] Does learning require memorization? A short tale about a long tail — reactive:deep-learning-theory-limits
  69. [69] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
  70. [70] [PDF] Chasing the Long Tail: What Neural Networks Memorize and Why — reactive:deep-learning-theory-limits
  71. [71] What Neural Networks Memorize and Why - Chiyuan Zhang — reactive:deep-learning-theory-limits
  72. [72] What Neural Networks Memorize and Why: Discovering the Long ... — reactive:deep-learning-theory-limits
  73. [73] Does Learning Require Memorization? A Short Tale About A Long Tail — reactive:deep-learning-theory-limits
  74. [74] Memorizing Long-tail Data Can Help Generalization Through Composition — reactive:deep-learning-theory-limits
  75. [75] [PDF] Training Data Pruning Improves Memorization of Facts - arXiv — reactive:deep-learning-theory-limits
  76. [76] What is the role of memorization in Continual Learning? | OpenReview — reactive:deep-learning-theory-limits
  77. [77] ‪Vitaly Feldman‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
  78. [78] What Neural Networks Memorize and Why (Vitaly Feldman) - NeurIPS — reactive:deep-learning-theory-limits
  79. [79] Vitaly Feldman's personal homepage — reactive:deep-learning-theory-limits
  80. [80] ‪Vitaly Feldman‬ - ‪Google Scholar‬ — reactive:deep-learning-theory-limits
  81. [81] [PDF] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
  82. [82] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
  83. [83] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
  84. [84] The Pitfalls of Memorization: When Memorization Hurts Generalization — reactive:deep-learning-theory-limits
  85. [85] [PDF] In Defense of Uniform Convergence: Generalization via ... — reactive:deep-learning-theory-limits
  86. [86] [1912.04265] In Defense of Uniform Convergence - arXiv — reactive:deep-learning-theory-limits
  87. [87] In Defense of Uniform Convergence: Generalization via Derandomization with an Application to Interpolating Predictors — reactive:deep-learning-theory-limits
  88. [88] [PDF] Exact Gap between Generalization Error and Uniform Convergence ... — reactive:deep-learning-theory-limits
  89. [89] [2103.04554] Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models — reactive:deep-learning-theory-limits
  90. [90] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
  91. [91] [Quick Review] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent — reactive:deep-learning-theory-limits
  92. [92] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
  93. [93] NeurIPS Poster Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
  94. [94] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training | OpenReview — reactive:deep-learning-theory-limits
  95. [95] Why diffusion models in generative AI don’t memorize: The role of implicit dynamical regularization in training | LPENS — reactive:deep-learning-theory-limits
  96. [96] The Role of Implicit Dynamical Regularization in Training - YouTube — reactive:deep-learning-theory-limits
  97. [97] [NeurIPS 2025] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
  98. [98] Why Diffusion Models Generalize Instead of Just Memorizing - Medium — reactive:deep-learning-theory-limits
  99. [99] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  100. [100] [2505.17638] Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training — reactive:deep-learning-theory-limits
  101. [101] (PDF) Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  102. [102] [PDF] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  103. [103] [PDF] Why Diffusion Models Don't Memorize: The Role of Implicit ... — reactive:deep-learning-theory-limits
  104. [104] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
  105. [105] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
  106. [106] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
  107. [107] [PDF] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
  108. [108] Data-Dependent Stability of Stochastic Gradient Descent — reactive:deep-learning-theory-limits
  109. [109] Fine-Grained Analysis of Stability and Generalization for Stochastic ... — reactive:deep-learning-theory-limits
  110. [110] [PDF] Stability and Generalization for Markov Chain Stochastic Gradient ... — reactive:deep-learning-theory-limits
  111. [111] [2502.00885] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise — reactive:deep-learning-theory-limits
  112. [112] Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses — reactive:deep-learning-theory-limits
  113. [113] [2602.22936] Generalization Bounds of Stochastic Gradient Descent ... — reactive:deep-learning-theory-limits
  114. [114] Stability (learning theory) — reactive:deep-learning-theory-limits
  115. [115] Announcing the ICML 2026 Mechanistic Interpretability Workshop ... — reactive:deep-learning-theory-limits
  116. [116] Call for Papers | Mechanistic Interpretability Workshop at ICML 2026 — reactive:deep-learning-theory-limits
  117. [117] Mechanistic Interpretability Named MIT's 2026 Breakthrough for ... — reactive:deep-learning-theory-limits
  118. [118] Bridging the Black Box: A Survey on Mechanistic Interpretability in AI — reactive:deep-learning-theory-limits
  119. [119] Understanding Mechanistic Interpretability in AI Models - IntuitionLabs — reactive:deep-learning-theory-limits
  120. [120] AI Safety, Alignment, and Interpretability in 2026 | Zylos Research — reactive:deep-learning-theory-limits
  121. [121] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits
  122. [122] MIT Technology Review's Post - Mechanistic interpretability - LinkedIn — reactive:deep-learning-theory-limits
  123. [123] Mechanistic Interpretability Workshop at ICML 2026 — reactive:deep-learning-theory-limits
  124. [124] Open problems in mechanistic interpretability: 2026 status report - Gist — reactive:deep-learning-theory-limits
  125. [125] Improving AI models' ability to explain their predictions | MIT News — reactive:deep-learning-theory-limits
  126. [126] Mech Interp Workshop @ ICML 2026 - Buttondown — reactive:deep-learning-theory-limits
  127. [127] Schedule — reactive:deep-learning-theory-limits
  128. [128] ICML 2026 Workshop Mech Interp - OpenReview — reactive:deep-learning-theory-limits
  129. [129] Announcing the ICML 2026 Workshops and Affinity Workshops – ICML Blog — reactive:deep-learning-theory-limits
  130. [130] An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers — Neel Nanda — reactive:deep-learning-theory-limits
  131. [131] Mechanistic Interpretability — Neel Nanda — reactive:deep-learning-theory-limits
  132. [132] [2407.14985] Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
  133. [133] ICLR Poster Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data — reactive:deep-learning-theory-limits
  134. [134] Generalization v.s. Memorization: Tracing Language Models'... — reactive:deep-learning-theory-limits
  135. [135] [PDF] generalization v.s. memorization - arXiv — reactive:deep-learning-theory-limits
  136. [136] For Better or for Worse, Transformers Seek Patterns for Memorization | OpenReview — reactive:deep-learning-theory-limits
  137. [137] [PDF] generalization v.s. memorization: tracing language models ... — reactive:deep-learning-theory-limits
  138. [138] Disentangling feature and lazy training in deep neural networks — reactive:deep-learning-theory-limits
  139. [139] Reactivation: Empirical NTK Dynamics Under Task Shifts — reactive:deep-learning-theory-limits
  140. [140] [PDF] Neural Tangent Kernel - Washington — reactive:deep-learning-theory-limits
  141. [141] [1811.04918] Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers — reactive:deep-learning-theory-limits
  142. [142] u-$μ$P: The Unit-Scaled Maximal Update Parametrization - arXiv — reactive:deep-learning-theory-limits
  143. [143] [PDF] Towards Understanding Grokking: An Effective Theory of ... — reactive:deep-learning-theory-limits
  144. [144] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  145. [145] [1902.04742v2] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
  146. [146] [1902.04742] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
  147. [147] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
  148. [148] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
  149. [149] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
  150. [150] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
  151. [151] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
  152. [152] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
  153. [153] NeurIPS Benign Overfitting in Out-of-Distribution Generalization of Linear Models — reactive:deep-learning-theory-limits
  154. [154] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
  155. [155] Uniform convergence may be unable to explain generalization in ... — reactive:deep-learning-theory-limits
  156. [156] Uniform convergence may be unable to explain generalization in deep learning — reactive:deep-learning-theory-limits
  157. [157] [1906.05271] Does Learning Require Memorization? A Short Tale about a Long Tail — reactive:deep-learning-theory-limits
  158. [158] Understanding Deep Learning (Still) Requires Rethinking ... — reactive:deep-learning-theory-limits
  159. [159] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
  160. [160] Symmetries in Overparametrized Neural Networks: A Mean Field View — reactive:deep-learning-theory-limits
  161. [161] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits
  162. [162] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  163. [163] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  164. [164] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  165. [165] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  166. [166] [2506.07998] Generative Modeling of Weights: Generalization or Memorization? — reactive:deep-learning-theory-limits
  167. [167] Generative Modeling of Neural Network Weights - Emergent Mind — reactive:deep-learning-theory-limits
  168. [168] Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology Review — reactive:deep-learning-theory-limits
  169. [169] Workshop on Mechanistic Interpretability - ICML 2026 — reactive:deep-learning-theory-limits
  170. [170] ICML 2026 Schedule — reactive:deep-learning-theory-limits
  171. [171] 2026 Conference — reactive:deep-learning-theory-limits
  172. [172] ICML 2026 Workshops — reactive:deep-learning-theory-limits
  173. [173] Prune Training Data to Maximize LLM Factual Memorization | Changecast — reactive:deep-learning-theory-limits
  174. [174] Training Data Pruning Improves Memorization of Facts — reactive:deep-learning-theory-limits
  175. [175] [PDF] TRAINING DATA PRUNING IMPROVES MEMORIZATION OF FACTS — reactive:deep-learning-theory-limits