The paper that killed deep learning theory
Alignment Forum · LawrenceC · 2026-04-26
(No summary yet for this item — extraction summaries are still backfilling.)
Appears in
Extraction
Topics: deep-learning-theorygeneralization-boundsstatistical-learning-theoryneural-network-memorizationzhang-et-al-2016
Claims
- Zhang et al. 2016 demonstrated that standard neural network architectures trained with standard procedures can memorize completely random labels on CIFAR-10 and ImageNet, achieving near-zero training loss.
- Because the same network class and training algorithm can either generalize or memorize depending solely on label correctness, data-independent complexity measures cannot explain neural network generalization.
- Training on random labels requires only 1.5–3.5x more steps than training on true labels, showing that memorization capacity is always latent in standard architectures.
- Explicit regularization techniques including data augmentation, weight decay, and dropout have minimal effect on both test accuracy and a model's ability to memorize random labels, undermining norm-based generalization explanations.
- The paper's observations about overparameterized linear regression hinted at phenomena later studied as double descent.
Key quotes
Deep neural networks easily fit random labels.
The authors' results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense.
By all the metrics – including both VC Dimension and Rademacher complexity – even a simple MLP with sigmoidal or ReLU activations represents far too complicated a hypothesis class to not immediately overfit on the training data. Yet, not only were neural networks performing better than other machine learning algorithms...