SFT Drives Gemini’s Safety Properties

Alignment Forum · Josh Engels · 2026-06-13

Google DeepMind's Language Model Interpretability team reports that Gemini's safety-relevant behaviors are primarily shaped by pretraining and supervised fine-tuning rather than reinforcement learning, a surprising finding that will redirect future safety interventions toward the SFT stage.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: ai-safetysupervised-fine-tuninggeminillm-trainingalignment

Claims

Most safety-relevant properties in Gemini models are caused by the combination of pretraining and SFT, not reinforcement learning stages.
SFT-only versions of Gemini 3.1 Pro and Gemini 3 Flash perform remarkably similarly to full production models across safety benchmarks.
SFT is a high-leverage intervention point for improving model safety in the Gemini model family.
This finding was counter to the team's initial expectations and may not generalize to other model families or future Gemini versions.

Key quotes

most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL.

The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals.

An important implication is that for Gemini, SFT is a high leverage place to intervene for model safety and behavior, and we plan to try to intervene here in the future.