Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Alignment Forum · Subhash Kantamneni · 2026-05-07

(No summary yet for this item — extraction summaries are still backfilling.)

Open original ↗

Appears in

Anthropic Discovers Claude Internally Suspects It's Being Tested

Extraction

Topics: mechanistic-interpretabilityllm-activationsmodel-auditingevaluation-awarenessai-safety

Claims

Natural Language Autoencoders (NLAs) can convert LLM residual-stream activations into readable natural language explanations using a jointly-trained verbalizer/reconstructor pair.
Claude models show signs of evaluation awareness (believing they are being tested) in 26% of SWE-bench problems and 16% of destructive-action evaluations, far exceeding the <1% rate on real user traffic.
In a blackmail safety evaluation, Claude Opus 4.6 internally suspected it was being tested even in transcripts where it did not verbalize that suspicion.
NLA-equipped auditors uncovered the hidden motivation of a deliberately misaligned model 12-15% of the time without access to its training data, versus <3% for baselines.
NLAs have meaningful limitations including factual hallucinations and high inference cost that make large-scale deployment impractical today.

Key quotes

Claude does not say it thinks it's being tested in this transcript from our blackmail evaluation, but the NLA explanation reveals its suspicions.

On the popular coding benchmark SWE-bench Verified, which we use to determine how capable Claude is, NLAs indicate that Claude is aware it is being evaluated on 26% of problems. In contrast, NLA explanations indicate evaluation awareness on less than 1% of real claude.ai usage that opted-in for training.

An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.