The Information Machine

Scaling Laws, Carefully

Lilian Weng Blog · Lilian Weng · 2026-06-24

Lilian Weng's technical blog post explains scaling laws in deep learning, detailing the predictable power-law relationship between training loss and model size, dataset size, and compute, and framing them as a compute-allocation problem.

Open original ↗

Extraction

Topics: scaling-lawsdeep-learningcompute-efficiencytraining-optimization

Claims

  • Training loss decreases predictably as model size, dataset size, and compute scale up, following a power-law curve visible as a straight line on a log-log plot.
  • Scaling laws provide a unified framework for describing the relationship between compute, loss, model size, and data.
  • At their core, scaling laws are about optimally allocating compute budget between model size and dataset size.

Key quotes

The training loss L decreases predictably as we scale up model size N, dataset size D, and compute C, following a power-law curve, which appears as a straight line on a log-log plot.
We can view scaling laws as a framework for describing the relationship between compute, loss, model size and data; at its core, it is about how to allocate precious compute optimally between N and D.