[Linkpost] Interpreting Language Model Parameters

Alignment Forum · Lucius Bushnaq · 2026-05-05

(No summary yet for this item — extraction summaries are still backfilling.)

Appears in

New Formal Methods for Reading Model Internals From Weights

Extraction

Topics: mechanistic-interpretabilityparameter-decompositionneural-network-circuitsattention-layer-analysisadversarial-ablation

Claims

adVersarial Parameter Decomposition (VPD) decomposes a language model's parameters into subcomponents each implementing only a small part of the model's learned algorithm, requiring only a small fraction of subcomponents to account for network behavior on any input.
VPD successfully decomposes attention layers, a capability that SAEs and transcoders have historically struggled to achieve.
Adversarial ablation is essential for faithfully identifying causally important nodes in attribution graphs, casting doubt on the faithfulness of subnetworks found by most other subnetwork identification methods that use stochastic or non-adversarial ablation.
VPD does not suffer from the parameter-space analog of feature splitting, in principle or in practice.
The parameter decomposition approach is now sufficiently mature to be applied at scale to production-scale models.

Key quotes

VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.

Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.

we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.