Frontier post-training recipe review with Finbarr Timbers
Interconnects · Nathan Lambert · 2026-06-16
Nathan Lambert and Finbarr Timbers survey the evolution of LLM post-training recipes from InstructGPT through 2026, arguing that Multi-teacher On-Policy Distillation has become the defining technique of frontier model development.
Extraction
Topics: post-trainingllm-training-recipesmulti-teacher-distillationreasoning-modelsrlhf
Claims
- Post-training recipes have shifted from simple three-stage SFT→RM→RL pipelines to Multi-teacher On-Policy Distillation (MOPD) as the dominant frontier approach in 2026.
- MOPD trains N domain-specialist teacher models, then trains a general student via on-policy rollouts minimizing reverse-KL to the relevant teacher's output distribution token by token.
- MiMo Flash V2 introduced MOPD, which DeepSeek V4 and Nemotron 3 Ultra subsequently scaled to more than 10 domain-specialist teachers.
- DPO has largely disappeared from frontier recipes because cleaner on-policy training pipelines reduce the rough distributional edges that DPO previously corrected.
- Organizational complexity, not purely technical constraints, drives recipe divergence between academic open-source labs and frontier industry labs.
Key quotes
The shape of a post-training recipe has changed more in the last year than in the prior three.
One key finding from our trials of doing on policy, multi-teacher on policy distillation is that teacher models trained with substantially different training pipelines cannot be effectively combined through a straightforward on policy distillation merge, resulting in suboptimal performance.
It's never stated very explicitly in these papers on like how the org chart impacts the recipe, but I would, I think it's a very strong signal within the, at least the delta between the fully open work and the kind of partially open work that you get from industry.