Mechanistic estimation for wide random MLPs

Alignment Forum · Jacob_Hilton · 2026-05-07

(No summary yet for this item — extraction summaries are still backfilling.)

Open original ↗

Appears in

New Formal Methods for Reading Model Internals From Weights

Extraction

Topics: mechanistic-estimationdeceptive-alignmentneural-network-theorymechanistic-trainingai-safety-research

Claims

ARC's cumulant propagation algorithm produces mechanistic estimates for wide random MLPs with mean squared error scaling as O(width^{-2}) versus O(width^{-1}) for Monte Carlo sampling, a factor-of-width improvement.
For ReLU MLPs with 4 hidden layers and width 256, their best algorithms outperform Monte Carlo sampling across FLOP budgets spanning 7 orders of magnitude.
Mechanistic estimation reads behavioral properties directly from weights without running the model, which could in principle catch deceptive alignment at train time even when every training input looks benign.
Mechanistic training — gradient descent on mechanistic estimates rather than sampled outputs — could cause models to account for rare high-loss events that stochastic gradient descent never samples.
The current results are limited to randomly initialized networks; extending to trained networks requires tracking higher-order deviations and remains an open problem.

Key quotes

If we could produce good mechanistic estimates for frontier models, we should be able to catch deceptive alignment at train time, even if the model's behavior looks benign on every training input.

Stochastic gradient descent may fail to sample the event even once during training, resulting in a model that does not account for the dangerous event at all. On the other hand, mechanistic training could notice how the rare event might transpire, resulting in a model that does a better job of avoiding it.

We are excited about these results not simply because of the performance improvements, but because our algorithms are 'mechanistic': rather than running the network many times and seeing what it does, we read off behavioral properties of the network directly from the weights.