Frontier post-training recipe review with Finbarr Timbers

Interconnects · Nathan Lambert · 2026-06-16

Nathan Lambert and Finbarr Timbers survey the evolution of LLM post-training recipes from InstructGPT through 2026, arguing that Multi-teacher On-Policy Distillation has become the defining technique of frontier model development.

Open original ↗

Appears in

LLM Efficiency Breakthroughs: Small Models and Sparse Architectures Challenge Scale Assumptions

Extraction

Topics: post-trainingllm-training-recipesmulti-teacher-distillationreasoning-modelsrlhf

Claims

Post-training recipes have shifted from simple three-stage SFT→RM→RL pipelines to Multi-teacher On-Policy Distillation (MOPD) as the dominant frontier approach in 2026.
MOPD trains N domain-specialist teacher models, then trains a general student via on-policy rollouts minimizing reverse-KL to the relevant teacher's output distribution token by token.
MiMo Flash V2 introduced MOPD, which DeepSeek V4 and Nemotron 3 Ultra subsequently scaled to more than 10 domain-specialist teachers.
DPO has largely disappeared from frontier recipes because cleaner on-policy training pipelines reduce the rough distributional edges that DPO previously corrected.
Organizational complexity, not purely technical constraints, drives recipe divergence between academic open-source labs and frontier industry labs.

Key quotes

The shape of a post-training recipe has changed more in the last year than in the prior three.

One key finding from our trials of doing on policy, multi-teacher on policy distillation is that teacher models trained with substantially different training pipelines cannot be effectively combined through a straightforward on policy distillation merge, resulting in suboptimal performance.

It's never stated very explicitly in these papers on like how the org chart impacts the recipe, but I would, I think it's a very strong signal within the, at least the delta between the fully open work and the kind of partially open work that you get from industry.