New Formal Methods for Reading Model Internals From Weights
What
Two research groups are independently developing methods to extract behavioral properties of neural networks directly from their weights — without running the model on inputs.
Lucius Bushnaq (affiliated with the parameter decomposition program) introduced adversarial Parameter Decomposition (VPD), which decomposes a model's parameters into small subcomponents each implementing a fragment of the learned algorithm, and claims this is now ready to apply at scale to production models [1].
ARC's Jacob Hilton presented mechanistic estimation, a separate approach that reads behavioral properties from weights and achieves O(width⁻²) mean squared error scaling versus O(width⁻¹) for Monte Carlo sampling — a factor-of-width improvement — though currently only on randomly initialized MLPs [2].
Both efforts are positioned as foundational tools for catching deceptive alignment, but sit at very different points on the maturity curve.
Why it matters
If either approach matures to cover frontier trained models, it would enable safety audits that cannot be gamed by a model that behaves well on training inputs — the core threat model behind deceptive alignment. VPD's claimed ability to decompose attention layers fills a known gap in the interpretability toolkit, while mechanistic estimation's framing as a basis for 'mechanistic training' sketches a path toward gradient descent that directly optimizes over rare, catastrophic events that stochastic sampling never sees.
Open questions
Can VPD's decomposition of attention layers be validated externally, and does it hold at the scale of frontier models rather than the toy or mid-size networks typically used in interpretability research? [1]
ARC's mechanistic estimation currently applies only to randomly initialized networks — how large is the gap to trained networks, and what theoretical obstacles must be cleared to track higher-order deviations? [2]
Adversarial ablation is central to VPD's faithfulness claims; does this challenge the validity of prior interpretability results (circuit analyses, subnetwork findings) that used stochastic or non-adversarial ablation? [1]
Are VPD and mechanistic estimation complementary or competitive framings of the same goal? Neither paper discusses the other, leaving open whether a unified weight-reading program is feasible.
Narrative
A small cluster of AI safety researchers is converging on a shared intuition: the most robust way to audit a trained model may be to read its behavior directly from its weights, bypassing the need to run it on inputs at all. Two distinct technical programs are pursuing this goal, arriving from different directions.
The parameter decomposition effort, announced in a linkpost by Lucius Bushnaq on May 5, 2026, introduces adversarial Parameter Decomposition (VPD) as the third and most capable iteration of a line of work that previously produced Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD) [1]. VPD breaks a model's weight matrices into subcomponents, each of which implements only a small piece of the model's learned computation, such that any given input only activates a small fraction of those subcomponents. The central technical claim is that VPD can decompose attention layers — a target that prior interpretability methods including sparse autoencoders (SAEs), transcoders, and cross-layer transcoders (CLTs) have historically failed to handle cleanly [1]. A secondary finding is that adversarial ablation — using worst-case rather than stochastic or zero ablations when assessing whether a subcomponent matters — is critical to identifying causally important nodes faithfully. The authors argue this casts doubt on the faithfulness of subnetworks identified by the large majority of prior subnetwork methods in the literature. The paper positions VPD as production-ready: 'the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about' [1].
Two days later, on May 7, ARC researcher Jacob Hilton published a complementary but independently motivated paper on mechanistic estimation [2]. Where VPD is an interpretability tool aimed at understanding what a model computes, mechanistic estimation is framed explicitly as a step toward catching deceptive alignment. The key idea is that a model which behaves well on all training inputs might nonetheless harbor dangerous capabilities that stochastic gradient descent simply never encounters. Mechanistic estimation sidesteps this by computing behavioral properties — expectations of the model's outputs — analytically from the weights using a cumulant propagation algorithm, without sampling. For wide random ReLU MLPs with 4 hidden layers and width 256, ARC's algorithms outperform Monte Carlo sampling across FLOP budgets spanning seven orders of magnitude, with MSE scaling as O(width⁻²) versus O(width⁻¹) for Monte Carlo [2]. The longer-term vision is 'mechanistic training': running gradient descent on mechanistic estimates rather than sampled outputs, so the optimizer would penalize dangerous behaviors that are too rare to appear in any finite sample of training runs. Hilton is careful to frame the current result as a 'base case' — the method applies to randomly initialized networks only, and extending it to trained networks requires tracking higher-order statistical deviations that remain an open problem [2].
The two programs share an underlying premise — that weights contain recoverable behavioral information — but diverge sharply in method, maturity, and immediate application. VPD is claimed to be ready for deployment on real models and is focused on interpretability; mechanistic estimation is explicitly foundational and is focused on alignment against deceptive agents. Neither paper cites the other, suggesting these are parallel rather than coordinated efforts.
Timeline
- 2026-05-05: Lucius Bushnaq posts linkpost announcing adversarial Parameter Decomposition (VPD), claiming readiness for production-scale deployment and successful decomposition of attention layers [1]
- 2026-05-07: ARC's Jacob Hilton publishes mechanistic estimation results for wide random MLPs, reporting O(width⁻²) MSE scaling and framing the work as a base case toward mechanistic training against deceptive alignment [2]
Perspectives
Lucius Bushnaq (parameter decomposition program)
VPD is a mature, scalable interpretability method that solves the attention-layer problem and exposes faithfulness flaws in prior subnetwork identification techniques; ready for deployment on frontier models
Evolution: Consistent with the program's iterative trajectory (SPD → APD → VPD), but represents a significant confidence escalation: the team is now claiming production readiness
Jacob Hilton / ARC
Mechanistic estimation of weights is a promising but early-stage approach to deceptive-alignment detection; current results are a foundational proof of concept on random networks, not a deployable tool
Evolution: Consistent with initial position; no new statements observed
Tensions
- VPD's authors claim their adversarial ablation finding casts doubt on the faithfulness of subnetworks found by 'the majority of other subnetwork identification methods in the literature' [1] — a broad implicit challenge to prior circuit-analysis and SAE-based work whose authors have not yet responded. [1]
- VPD claims production readiness for frontier models [1], while mechanistic estimation explicitly flags a large gap between current capabilities (random networks) and practical applicability to trained models [2] — the two programs implicitly disagree on how far weight-reading methods have actually come. [1][2]
Status: active but slowing
Sources
- [1] [Linkpost] Interpreting Language Model Parameters — Alignment Forum (2026-05-05)
- [2] Mechanistic estimation for wide random MLPs — Alignment Forum (2026-05-07)