Great Stanford + MIT + Harvard + Anthropic paper.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-08

A joint Stanford, MIT, Harvard, and Anthropic paper concludes that larger AI models acquire rare skills smaller models miss because their greater capacity causes less forgetting of weakly-learned abilities during training.

Open original ↗

Extraction

Topics: llm-scalingemergent-abilitiestraining-dynamicsmodel-capacitymachine-learning-theory

Claims

Larger AI models learn rare skills that smaller models fail to acquire.
The mechanism behind this gap is differential forgetting: larger models forget weakly-learned signals less than smaller models do.
Extra model capacity functions as a buffer that protects rare, weakly-reinforced abilities during training.
This paper provides a training-based mechanistic explanation for capability emergence at scale.

Key quotes

Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning