The Information Machine

The paper that killed deep learning theory

Alignment Forum · LawrenceC · 2026-04-26

(No summary yet for this item — extraction summaries are still backfilling.)

Open original ↗

Appears in

Extraction

Topics: deep-learning-theorygeneralization-boundsstatistical-learning-theoryneural-network-memorizationzhang-et-al-2016

Claims

  • Zhang et al. 2016 demonstrated that standard neural network architectures trained with standard procedures can memorize completely random labels on CIFAR-10 and ImageNet, achieving near-zero training loss.
  • Because the same network class and training algorithm can either generalize or memorize depending solely on label correctness, data-independent complexity measures cannot explain neural network generalization.
  • Training on random labels requires only 1.5–3.5x more steps than training on true labels, showing that memorization capacity is always latent in standard architectures.
  • Explicit regularization techniques including data augmentation, weight decay, and dropout have minimal effect on both test accuracy and a model's ability to memorize random labels, undermining norm-based generalization explanations.
  • The paper's observations about overparameterized linear regression hinted at phenomena later studied as double descent.

Key quotes

Deep neural networks easily fit random labels.
The authors' results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense.
By all the metrics – including both VC Dimension and Rademacher complexity – even a simple MLP with sigmoidal or ReLU activations represents far too complicated a hypothesis class to not immediately overfit on the training data. Yet, not only were neural networks performing better than other machine learning algorithms...