Can activation verbalizers surface an internal chain of thought?

Alignment Forum · oakhu · 2026-06-07

Alignment Forum researcher oakhu evaluates activation verbalizers — tools that map residual stream activations to natural language — on whether they can surface a model's internal chain of thought during math problem solving, finding that current open-weight natural language autoencoders are unreliable, noisy, and likely confabulating rather than reading genuine internal reasoning.

Open original ↗

Extraction

Topics: mechanistic-interpretabilityactivation-verbalizersnatural-language-autoencodersai-safetyllm-internals

Claims

Current open-weight NLAs for Qwen2.5, Gemma, and Llama are noisier than the differences between math problem variants, meaning a trivial baseline (the average activation direction) outperforms them at reconstruction.
Verbalizations cannot be clearly distinguished from confabulations generated from the model's prompt and output, as Claude Sonnet best-of-ten confabulations achieve comparable reconstruction loss to actual AVs.
Verbalizations hint at some reasoning before correct outputs around 76% of the time for Gemma on the easiest problems, but fewer than 16% look like coherent chains of thought.
Verbalizations almost never surface the cause of wrong outputs — only 2 of 763 wrong-answer verbalizations for Gemma on the full dataset plausibly explained the specific error.
Steering on edited verbalizations can sometimes correct model outputs, but analogous vectors from unrelated problems work at a comparable rate (~17%), undermining the interpretation that verbalizations reveal the actual causal error.

Key quotes

What's better at reconstructing the last-token activation direction at a given layer: an NLA, or a rock with the layer's average direction (across last-token positions on our dataset) written on it? For Qwen, it's the rock, even at the layer where the NLA was trained.

NLAs are very elegant, and seem like one of the most promising interpretability methods right now. However, current open-weight NLAs seem kind of weak in various ways: as we saw above, the fraction of variance in activation direction across changes to the key parameter of a math problem that they leave unexplained is greater than one.

Few of Gemma's verbalizations before wrong outputs even hint at some cognition that might have led to the output (14% [.073,.238]), and just 1/78 looks at all like a chain of thought.