Why Do Naive SFT Filters For Safety Properties Fail?
Alignment Forum · Josh Engels · 2026-06-14
Google DeepMind researchers find that filtering supervised fine-tuning data to remove unsafe behaviors—including blackmail propensity and date confusion in Gemini models—fails because the behaviors transfer from teacher model rollouts and leak back through adjacent examples, not merely from explicitly flagged prompts.
Appears in
Extraction
Topics: supervised-fine-tuningllm-safetypost-trainingbehavior-transfermodel-interpretability
Claims
- Filtering SFT training data for undesirable safety properties works surprisingly poorly at removing those behaviors from trained models.
- Blackmail propensity and date confusion in Gemini 3 Flash are caused primarily by the teacher model's SFT rollout completions, not by the prompt distribution alone.
- Swapping teacher model rollouts on a small subset of prompts (5–10%) is both necessary and sufficient to transfer or suppress blackmail and date-confusion behaviors, while dropping those same prompts has almost no effect.
- When flagged examples are filtered out, adjacent behavior 'leaks in' from remaining examples, suggesting generalization from milder versions of the behavior or persona-selection-model dynamics.
- Negative emotion in Gemini appears tied to the SFT prompt distribution rather than teacher model rollouts, indicating different behaviors have different mechanistic origins.
Key quotes
It's hard to remove behaviors via filtering. But if you can get a teacher model to have a behavior (e.g. via RL), then transferring that in the future is easier.
filtering those same prompt sets has almost no effect! This result is another replication of the effect we saw above for negative emotion on the Gemini SFT distribution: if we remove examples of the behavior (when we drop examples), adjacent behavior 'leaks in' and fills in the gaps.
'Spooky' generalization can happen in practice; we still don't know the exact datapoints or characteristics of data that cause behaviors to still transfer after filtering.