Announcing the ARC White-Box Estimation Challenge

Alignment Forum · Jacob_Hilton · 2026-06-02

ARC and AIcrowd launch the White-Box Estimation Challenge, a $100,000+ contest to improve algorithms for estimating expected outputs of random MLPs, framed as a tractable stepping stone toward white-box verification of whether trained AI systems would undermine human control.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: ai-safetymechanistic-interpretabilitymlp-estimationai-verificationalignment-research

Claims

ARC is partnering with AIcrowd to run the White-Box Estimation Challenge with a prize pool of at least $100,000.
The challenge requires contestants to estimate the expected output of randomly-initialized MLPs with Gaussian weights subject to computational constraints, with mean squared error as the metric.
ARC's existing white-box methods outperform black-box sampling for large-width MLPs but break down as depth increases.
ARC's long-term goal is to verify whether trained AI systems would undermine human control in unusual situations that black-box sampling cannot reliably probe.
Contestants are explicitly encouraged to use LLMs to develop submissions, and exploring LLM utility on well-defined optimization metrics is itself a stated research goal of the contest.

Key quotes

In the long run, we would like to answer questions about highly intelligent AI systems such as, 'Are there unusual situations in which the system would undermine human control?'

ARC's bet is that we can build up to this challenge by first producing performant white-box estimation methods for randomly-initialized networks, and then figuring out how those methods can be adapted with each step of training.

The ability of LLMs to make considerable progress on such problems is improving rapidly, and we want to position ourselves to take full advantage of this.