Synthetic document finetuning for instilling positive traits
Alignment Forum · CallumMcDougall · 2026-06-16
Google DeepMind's Language Model Interpretability team demonstrates that midtraining Gemini 3 Flash on synthetic documents followed by supervised finetuning on synthetic chat data robustly instills desired behavioral traits in out-of-distribution safety evaluations, and shares a pipeline for detecting superficial data artifacts that can corrupt training.
Appears in
Extraction
Topics: alignmentsynthetic-datafinetuningmodel-valuesood-generalization
Claims
- SFT on synthetic chat data demonstrating desired traits produces mild-to-significant improvements on alignment-based out-of-distribution safety evaluations for Gemini 3 Flash.
- Midtraining on synthetic pretraining-style documents more effectively instills stated knowledge of traits than SFT alone, but does not reliably produce behavioral alignment without subsequent SFT.
- Models trained on synthetic SFT data can inadvertently pick up superficial structural patterns (e.g., always asking for clarification) that do not surface in standard eval scores.
- Bounded DPO (BDPO) showed marginal and inconsistent improvements over SFT while being harder to tune, making it not worth the added complexity.
- Multi-turn adversarial evaluations are substantially more diagnostic than single-turn evals for detecting trait violations such as sycophancy under sustained pressure.
Key quotes
Our motivation is deep alignment: we want to train principles into the model which guide behaviour even in highly OOD behaviours.
Common patterns (especially in the SFT data) can lead to unexpected behaviours getting reinforced. Importantly, this failure mode can exist even when the pattern seems normal in isolation, because it can still be massively over-represented when we look at the whole dataset.
One important takeaway from our project was that we got positive results on knowledge evals before getting positive results on behavioural evals.