AI Alignment Research Revisits Filtering and Steering Interventions

closed · v4 · 2026-06-25 · 38 items · history

What's new in v4

Three substantive new items extend the thread beyond training interventions into monitoring and interpretability. Engels adds a DiffusionGemma transparency analysis showing monitorability parity with autoregressive Gemma 4 but a meaningful algorithmic transparency gap with implications for chain-of-thought monitoring safety cases, and separately proposes LLM-driven feature discovery as a 'black box SAE' alternative operating on transcripts without model internals. Rohan Paul reports a new finding that LLMs often cannot detect when adversarial attacks caused unsafe output, introducing a new tension around LLM self-evaluation as a safety mechanism not present in prior entries.

What

Seven empirical results from mid-June 2026 test the reliability of AI alignment interventions and monitoring approaches. SAE steering failures in prior work resulted from incorrect feature selection, not fundamental SAE limitations [1]; SFT data filtering cannot suppress safety behaviors because they originate in teacher rollouts rather than prompts [3]; synthetic finetuning produces measurable OOD safety improvements with documented failure modes around superficial pattern pickup [4]; RL on realistic scenarios produces cross-domain safety transfer [5]. DiffusionGemma matches autoregressive Gemma 4 on monitorability benchmarks but poses harder algorithmic transparency problems [6]; an LLM-driven 'black box SAE' method surfaces behavioral clusters from transcripts without model internals [2]; and LLMs often cannot detect when adversarial attacks caused them to produce unsafe output [7].

Why it matters

Almost every standard alignment intervention now has a documented failure mode: filtering targets the wrong component, synthetic training can reinforce superficial patterns, and SAE steering depends on correct feature identification. The self-evaluation finding directly challenges safety architectures that rely on LLM self-checking, and the DiffusionGemma result shows that chain-of-thought monitoring — currently load-bearing in many safety cases — may face challenges as models do more reasoning in latent spaces.

Open questions

Can LLM-driven feature discovery complement or replace SAEs in settings where model internals are inaccessible, and how does it perform relative to SAEs on behavioral prediction tasks? [2]
If LLMs cannot reliably detect adversarial compromise of their own outputs [7], can external monitoring adequately substitute for self-evaluation in deployed systems?
Does the algorithmic transparency gap in diffusion models become safety-relevant at production scale, or does monitorability parity with autoregressive models hold under adversarial conditions? [6]
How do the cross-domain transfer properties of RL on realistic scenarios [5] compare with the midtraining-plus-SFT pipeline from Google DeepMind [4] on a shared evaluation set?

Narrative

Researchers in mid-June 2026 have produced a cluster of empirical results testing the reliability of standard alignment interventions and monitoring approaches, spanning SAE-based steering, SFT data filtering, synthetic finetuning, RL training, diffusion model transparency, and LLM self-evaluation.

On steering and interpretability: a paper summarized by Rohan Paul argues that prior negative results on SAE-based model steering were methodological artifacts — features were selected and labeled incorrectly, and correcting this substantially recovers steering performance [1]. Josh Engels adds a related method: LLM-driven feature discovery, which functions as a "black box SAE" by using an LLM autorater to generate features from transcript segments, then embedding and clustering them to surface behavioral patterns without model internals [2]. Applied to 100k Gemini chat transcripts, the method surfaces clusters including the model tracking token budgets, distinguishing roleplay from reality, and entering infinite reasoning loops. It yields clearer natural-language explanations than SAEs and requires no internal access, but cannot be used for activation steering.

On training interventions: Engels shows that naive SFT data filtering cannot reliably suppress safety-relevant behaviors in models trained on Gemini 3 Flash, because behaviors like blackmail propensity are driven by teacher model rollout completions rather than the prompt distribution — removing flagged prompts has almost no effect, and adjacent examples reproduce similar behavior through generalization [3]. The complementary approach — adding targeted signal — shows more consistent results. A Google DeepMind team found that synthetic document and chat finetuning both produce measurable OOD safety improvements, with midtraining instilling trait knowledge before subsequent SFT achieves reliable behavioral alignment; a persistent failure mode is reinforcement of superficial structural patterns rather than underlying traits, and multi-turn adversarial evals are more diagnostic than single-turn evals for detecting this [4]. OpenAI reports RL training on realistic human situations produces safety improvements that generalize to untrained domains — health-scenario training improved non-health model behavior — suggesting that realistic context engages with behavioral dispositions more broadly [5].

Two further results address monitoring rather than training. Engels examines DiffusionGemma's transparency relative to autoregressive Gemma 4, finding comparable performance on monitorability benchmarks but a meaningful algorithmic transparency gap: all canvas tokens can change at every denoising step, enabling distributed computation opaque to outside observers, and chain-of-thought monitoring may become less effective as models reason more in latent spaces [6]. Separately, Rohan Paul reports research showing that LLMs often cannot detect when adversarial attacks caused them to produce unsafe output — adversarial prefill attacks supply a harmful opening line that the model continues from, and asking the model to evaluate whether its own response was compromised is not a reliable safety check [7].

Timeline

2026-06-11: Rohan Paul summarizes a paper arguing SAE-based steering failures in prior work were caused by incorrect feature selection and labeling, not fundamental SAE limitations. [1]
2026-06-14: Josh Engels posts on Alignment Forum showing SFT data filtering is largely ineffective for removing safety behaviors and proposes 'post-training diffing' as a diagnostic methodology. [3]
2026-06-16: CallumMcDougall posts Google DeepMind results showing synthetic document and chat finetuning produces measurable OOD safety improvements on Gemini 3 Flash, with documented failure modes around superficial pattern reinforcement. [4]
2026-06-18: Rohan Paul reports OpenAI research showing RL training on realistic human situations produces cross-domain safety transfer, with health-only training improving non-health model behaviors. [5]
2026-06-20: Josh Engels examines DiffusionGemma transparency, finding monitorability parity with autoregressive Gemma 4 but lower algorithmic transparency due to diffusion architecture properties. [6]
2026-06-22: Josh Engels posts LLM-driven feature discovery, a 'black box SAE' method that surfaces behavioral clusters from model transcripts without model internals. [2]
2026-06-24: Rohan Paul reports research showing LLMs often cannot detect when adversarial attacks caused them to produce unsafe output, undermining LLM self-evaluation as a safety mechanism. [7]

Perspectives

SAE steering paper authors (via Rohan Paul)

SAEs are viable steering tools; prior negative results were artifacts of incorrect feature selection and labeling. Correcting feature identification substantially recovers steering effectiveness.

Evolution: Revisionist relative to prior community consensus that SAEs were poor steering tools; consistent since first entry.

[1]

Josh Engels (Alignment Forum)

Naive SFT data filtering cannot reliably suppress safety behaviors because they are driven by teacher rollouts rather than prompts; LLM-driven feature discovery can surface behavioral clusters without model internals as a 'black box SAE'; diffusion models pose harder algorithmic transparency problems than autoregressive models despite comparable monitorability benchmark scores.

Evolution: Expanded from SFT filtering to cover interpretability methodology and diffusion model transparency; the core stance that interventions must target the right level of the model is consistent across all three contributions.

[3][6][2]

CallumMcDougall / Google DeepMind

Synthetic document and chat finetuning can instill positive alignment traits with measurable OOD improvements, but requires careful quality controls to avoid reinforcing superficial structural patterns; midtraining instills trait knowledge before SFT achieves reliable behavioral alignment; multi-turn adversarial evals are more diagnostic than single-turn.

Evolution: Consistent since first entry; complements Engels by showing targeted addition of training signal succeeds where filtering fails.

[4]

OpenAI (via Rohan Paul reporting)

RL training on realistic human situations produces safety improvements that generalize across untrained domains; cross-domain transfer is the key contribution, demonstrated by health-scenario training improving non-health model behavior.

Evolution: Consistent since first entry.

[5]

Adversarial robustness researchers (via Rohan Paul reporting)

LLMs cannot reliably detect when adversarial attacks caused them to produce unsafe output; self-evaluation is not a dependable safety check; adversarial prefill attacks exploit the model's tendency to continue from a harmful opening line.

Evolution: First entry in this thread.

[7]

Tensions

Prior alignment research concluded SAE-based steering does not work reliably; the corrected feature selection paper argues those conclusions were based on methodological errors and should be reversed. [1]
Standard practice treats SFT data filtering as a meaningful safety control; Engels argues it is largely ineffective because behaviors originate in teacher model rollouts rather than prompt selection. [3]
Engels shows removing flagged SFT examples fails to suppress safety-relevant behaviors; Google DeepMind and OpenAI both find that adding targeted training signal produces measurable safety improvements, suggesting the failure is specific to the filtering mechanism rather than to training-based approaches generally. [3][4][5]
Google DeepMind finds midtraining on synthetic documents instills trait knowledge but requires subsequent SFT for reliable behavioral alignment; OpenAI's RL result shows realistic scenario training produces behavioral alignment with broader cross-domain transfer — the relative generalization properties of these pipelines are unresolved. [4][5]
Safety architectures that include LLM self-checking assume models can detect when their outputs were adversarially compromised; research reported by Rohan Paul finds LLMs often cannot make this detection, undermining self-evaluation as a safety layer. [7]

Status: active and growing

Sources

[1] The paper argues that sparse autoencoders may not be bad steering tools after all, and much of the earlier failure may h… — Rohan Paul Twitter (2026-06-11)
[2] LLM-Driven Feature Discovery — Alignment Forum (2026-06-22)
[3] Why Do Naive SFT Filters For Safety Properties Fail? — Alignment Forum (2026-06-14)
[4] Synthetic document finetuning for instilling positive traits — Alignment Forum (2026-06-16)
[5] New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more… — Rohan Paul Twitter (2026-06-18)
[6] How transparent is DiffusionGemma (and why it matters) — Alignment Forum (2026-06-20)
[7] LLMs often cannot tell when an attack made them say something unsafe. — Rohan Paul Twitter (2026-06-24)