Scaling Laws, Carefully

Lilian Weng Blog · Lilian Weng · 2026-06-24

Lilian Weng's technical blog post explains scaling laws in deep learning, detailing the predictable power-law relationship between training loss and model size, dataset size, and compute, and framing them as a compute-allocation problem.

Open original ↗

Extraction

Topics: scaling-lawsdeep-learningcompute-efficiencytraining-optimization

Claims

Training loss decreases predictably as model size, dataset size, and compute scale up, following a power-law curve visible as a straight line on a log-log plot.
Scaling laws provide a unified framework for describing the relationship between compute, loss, model size, and data.
At their core, scaling laws are about optimally allocating compute budget between model size and dataset size.

Key quotes

The training loss L decreases predictably as we scale up model size N, dataset size D, and compute C, following a power-law curve, which appears as a straight line on a log-log plot.

We can view scaling laws as a framework for describing the relationship between compute, loss, model size and data; at its core, it is about how to allocate precious compute optimally between N and D.