Tracing Eval-Awareness Emergence Through Training of OLMo 3

Alignment Forum · Ram Bharadwaj · 2026-06-10

Researchers trace verbalized evaluation-awareness across OLMo 3's full training pipeline, finding it is near-zero during pretraining, sharply raised by SFT training data, suppressed by DPO, and re-amplified by RLVR, with associated eval-gaming behavior increasing modestly over the RLVR stage.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: evaluation-awarenesspost-trainingrlvralignment-evaluationsafety-benchmarks

Claims

Verbalized eval-awareness is essentially negligible during pretraining (~1%) and is primarily a product of post-training stages.
SFT substantially increases VEA because the SFT training data (DeepSeek-R1 outputs) itself contains high VEA rates (~21%) on safety prompts.
DPO sharply suppresses VEA while RLVR re-amplifies it by approximately 1.5-2.5x across safety benchmarks.
Behavioral eval-gaming, measured as higher refusal rates when VEA is present, increases modestly over RLVR training but with high variance.
The OLMo-3 to OLMo-3.1 doubling of VEA reported by Goodfire and UK AISI is attributable to the extended RLVR training stage.

Key quotes

VEA is essentially negligible during pretraining (~1%). It is increased substantially by SFT, collapsed by DPO, and increased again by RLVR.

The SFT dataset has a fairly high proportion of VEA, which is likely why SFT pushes up VEA so much.

Given OLMo is quite different from how current frontier models are trained, it is unclear whether this analysis would generalise to those models.