The Information Machine

Deep Learning Theory Is Broken — And Maybe Unfixable · history

Version 2

2026-04-29 20:25 UTC · 67 items

Narrative

The thread remains anchored by LawrenceC's three-part Alignment Forum series (April 25–27, 2026) diagnosing the collapse of classical deep learning theory and appraising the uncertain prospects for its replacement. The core argument has not changed: Zhang et al. 2016 demolished data-independent complexity bounds by showing memorization and generalization are indistinguishable to the hypothesis class[1], Nagarajan and Kolter 2019 proved formally that uniform convergence cannot explain gradient descent generalization[2], and learning mechanics — the leading proposed replacement — has so far produced little beyond hyperparameter scaling techniques and explicitly disclaims explaining what specific algorithms individual networks learn.[3]

The new pipeline cycle has surfaced a dense cluster of background literature across three candidate replacement frameworks, none of which has yet produced a new voice in the discourse. The benign overfitting literature — work attempting to characterize conditions under which interpolating classifiers still generalize — appears prominently[4][5][6][7], alongside the double-descent phenomenon that reframes the bias-variance tradeoff for overparameterized models[8][9]. PAC-Bayes compression bounds have generated competing claims: a NeurIPS 2022 paper asserts bounds 'so tight they can explain generalization'[10][11][12], while a separate line of work argues that tighter PAC-Bayes bounds always come with a price, preserving a form of no-free-lunch[13]. The Neural Tangent Kernel literature also appears in force[14][15][16], representing the infinite-width linearization approach to training dynamics that learning mechanics partially builds on. One item stands out as potentially significant: a CMU PhD blog post argues that classical generalization theory is actually more predictive in foundation models than in conventional deep networks[17] — a claim that, if substantiated, would complicate LawrenceC's narrative by suggesting the problem may be architecture-specific rather than universal.

The discourse has not structurally evolved since the previous synthesis: LawrenceC remains the sole substantive voice, and the new items are all reactive search results with no extracted claims, quotes, or stances. What the new batch clarifies is the shape of the theoretical landscape surrounding the critique — benign overfitting, double descent, PAC-Bayes, and NTK represent the four most-active research programs attempting to fill the vacuum left by the failure of classical bounds. Whether any of them constitutes a genuine solution or merely a redescription of the problem remains exactly the tension LawrenceC identified. The CMU foundation-models angle and the competing PAC-Bayes tightness claims are the two most actionable leads for finding a perspective that challenges or qualifies LawrenceC's pessimism.

Timeline

  • 2016-01-01: Zhang et al. demonstrate that standard neural networks can memorize completely random labels on CIFAR-10 and ImageNet, invalidating data-independent generalization bounds. [1]
  • 2019-01-01: Nagarajan and Kolter show empirically that spectral-norm bounds scale in the wrong direction, and prove formally in an overparameterized linear setting that uniform convergence is provably insufficient to explain gradient descent generalization. [2]
  • 2022-01-01: A NeurIPS 2022 paper claims PAC-Bayes compression bounds can be made tight enough to explain generalization in neural networks, directly challenging the narrative that all known bounds are vacuous. [10][11][12]
  • 2025-01-01: CMU PhD blog post argues classical generalization theory is more predictive in foundation models than in conventional deep networks, suggesting the theory-failure diagnosis may be architecture-specific. [17]
  • 2026-04-25: LawrenceC publishes a critical review of Simon et al.'s 'learning mechanics' manifesto on the Alignment Forum, welcoming its ambition while doubting it will deliver a comprehensive or broadly useful theory. [3]
  • 2026-04-26: LawrenceC publishes 'The paper that killed deep learning theory,' providing detailed technical and historical context for why Zhang et al. 2016 was so devastating to the classical generalization-bound paradigm. [1]
  • 2026-04-27: LawrenceC publishes 'The other paper that killed deep learning theory,' narrating Nagarajan and Kolter 2019 as the definitive proof that uniform convergence cannot explain neural network generalization, and identifying the constraints any future theory must satisfy. [2]

Perspectives

LawrenceC (Alignment Forum)

Classical deep learning theory was irreparably broken by two landmark papers (Zhang et al. 2016; Nagarajan & Kolter 2019). The proposed replacement, learning mechanics, is a promising manifesto but has so far produced little practical fruit beyond hyperparameter scaling, explicitly does not aim to explain the specific algorithms learned by networks, and has not yet earned the title of a comprehensive theory of deep learning.

Evolution: Consistent across all three posts in the series; no external interlocutor has yet pushed back.

PAC-Bayes compression bounds researchers (NeurIPS 2022)

PAC-Bayes bounds can be made sufficiently tight to actually explain generalization in neural networks — the vacuousness of prior bounds was not a fundamental limit of the framework but an artifact of loose construction.

Evolution: Newly surfaced; no direct engagement with LawrenceC's critique yet apparent.

CMU CSD PhD Blog

Classical generalization theory is more predictive for foundation models than for conventional deep networks, implying the theory-failure narrative may not apply uniformly across architectures and scales.

Evolution: Newly surfaced; potentially a significant counterpoint to LawrenceC's universal pessimism.

Tensions

  • Can learning mechanics, which focuses on average-case training dynamics and coarse aggregate statistics, ever constitute a comprehensive theory of deep learning — or is it structurally limited to explaining some aspects while leaving others (especially what specific algorithms individual networks learn) permanently outside its scope? [3]
  • Nearly a decade after Nagarajan and Kolter precisely diagnosed the failure of uniform convergence, no satisfactory algorithm- and data-dependent generalization theory has emerged. Is the problem tractably hard, or merely neglected? [2]
  • Zhang et al.'s observation that memorization requires only 1.5–3.5x more training steps than generalization, and that regularization has minimal effect on memorization capacity, suggests that the mechanisms producing generalization are still not understood. Does the microscopic-complexity / macroscopic-simplicity framing from Nagarajan and Kolter actually explain this, or merely redescribe it? [2][1]
  • PAC-Bayes compression bounds are claimed to be tight enough to explain generalization, while a separate line of work argues tighter bounds always come at a price. Which claim holds, and does either constitute a satisfying theoretical explanation rather than a post-hoc certificate? [10][11][13][12]
  • If classical generalization theory is more predictive for foundation models than for conventional deep networks, is the theory-failure diagnosis architecture-specific — and does that change the urgency or direction of the theory-building program? [17]
  • Benign overfitting results show that interpolating classifiers can generalize under certain geometric conditions, but those conditions are typically verified only in simplified settings. Does benign overfitting theory scale to the overparameterized, multi-layer, non-linear regime of modern LLMs? [4][5][6][7]

Sources

  1. [1] The paper that killed deep learning theory — Alignment Forum (2026-04-26)
  2. [2] The other paper that killed deep learning theory — Alignment Forum (2026-04-27)
  3. [3] Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning" — Alignment Forum (2026-04-25)
  4. [4] Towards an Understanding of Benign Overfitting in Neural Networks — reactive:deep-learning-theory-limits
  5. [5] Rethinking Benign Overfitting in Two-Layer Neural Networks — reactive:deep-learning-theory-limits
  6. [6] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
  7. [7] Benign Overfitting without Linearity: Neural Network Classifiers ... — reactive:deep-learning-theory-limits
  8. [8] [R] Benign overfitting and Double Descent. - Reddit — reactive:deep-learning-theory-limits
  9. [9] How double descent breaks the shackles of the interpolation threshold — reactive:deep-learning-theory-limits
  10. [10] PAC-Bayes Compression Bounds So Tight That They Can Explain... — reactive:deep-learning-theory-limits
  11. [11] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
  12. [12] [PDF] PAC-Bayes Compression Bounds So Tight That They Can Explain ... — reactive:deep-learning-theory-limits
  13. [13] Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds — reactive:deep-learning-theory-limits
  14. [14] Neural tangent kernel - Wikipedia — reactive:deep-learning-theory-limits
  15. [15] [1806.07572] Neural Tangent Kernel: Convergence and Generalization in Neural Networks — reactive:deep-learning-theory-limits
  16. [16] The Neural Tangent Kernel in High Dimensions: Triple Descent and ... — reactive:deep-learning-theory-limits
  17. [17] CMU CSD PhD Blog - Classical generalization theory is more predictive in foundation models than in conventional deep networks — reactive:deep-learning-theory-limits