LLM-Driven Feature Discovery

Alignment Forum · Josh Engels · 2026-06-22

Josh Engels at the Alignment Forum describes LLM-Driven Feature Discovery, a black-box method for surfacing behavioral clusters in LLM transcripts via autorater-generated features, semantic embedding, and clustering — applied to 100k Gemini chat transcripts without requiring model internals.

Open original ↗

Extraction

Topics: llm-behavior-analysisfeature-discoveryinterpretabilitymodel-evaluationsparse-autoencoders

Claims

LLM-driven feature discovery can surface qualitative behavioral clusters from model transcripts using only model outputs, without access to internal activations.
The method functions as a 'black box SAE' — an LLM autorater generates features per transcript segment, which are embedded, clustered, and labeled — analogous to sparse autoencoder featurization but operating on text rather than activations.
Applied to 100k Gemini chat transcripts, the method produces many interesting high-level behavioral clusters, including the model tracking its token budget, detecting roleplay vs. reality, and entering infinite reasoning loops.
Logistic regression probes trained on user-turn features mostly fail to predict thought or response clusters, with the exception of obvious correlations such as language matching and domain-specific formatting.
Compared to SAEs, the method offers clearer natural-language explanations and requires no model internals, but cannot be used for activation steering and is more computationally expensive.

Key quotes

During the project, we sometimes thought of this work as a sort of 'black box SAE', since it was solving a similar problem as SAEs of featurizing model text, but without using model internals.

We find that there are many interesting high level features, particularly in model thoughts. For example, the model being aware of the number of tokens it can generate, considering whether the scenario is reality or a roleplay, and getting stuck in infinite loops.

One proxy task that seems interesting is building a (potentially very long) natural language report such that by reading it, one would be able to understand the way that Gemini would act in many situations.