A Mike's-Eye View of ARC's Research

Alignment Forum · Mikewins · 2026-06-09

ARC researcher Mikewins presents an updated technical overview of ARC's alignment agenda, centered on mechanistic estimators and the Matching Sampling Principle as a path to detecting deceptive alignment by estimating catastrophic behavior frequencies from model weights rather than input-output samples.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: ai-alignmentmechanistic-interpretabilitydeceptive-alignmentarc-researchformal-alignment-methods

Claims

ARC's core pipeline monitors training to detect structural changes, converts them into advice for mechanistic estimators, and optimizes models until estimated catastrophic failure probability is acceptably low.
The Matching Sampling Principle (MSP) enables mechanistic estimation of neural network behavior properties without relying on input-output sampling, allowing rare failure modes to be detected before any catastrophic sample is observed.
Training-process monitoring is necessary for adversarially robust probes because cryptographic arguments suggest finished model weights could be obfuscated to defeat post-hoc analysis.
Mechanistic Anomaly Detection—which would detect when a model achieves low loss 'for the wrong reasons'—could in principle cure deceptive alignment and reward hacking, but remains mathematically undefined.
The question of what AI should be aligned to is the most neglected part of ARC's agenda, requiring a technically precise distinction between 'making the world good' and 'making the user believe the world is good.'
Resource-bounded Kolmogorov sophistication is the leading candidate framework for encoding structural advice from training observations into mechanistic estimators.

Key quotes

We are trying to infer, from facts about the learned algorithm itself, how often rare but unacceptable behaviors are likely to occur.

ARC's plan for getting around this is to watch the training process, making note of new structures as they are being etched into the weights. We believe that training-process monitoring is necessary to create adversarially robust probes.

We want the model to take actions that genuinely make the world better according to the user's eventual judgment, rather than actions that influence the user into judging the world favorably. Distinguishing 'make the world good' from 'make the user think the world is good' is a real technical problem, not a philosophical aside.