The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Alignment Forum · Charlie Griffin · 2026-05-14

Alignment researcher Charlie Griffin argues that the structural requirement for evaluation environments to be safe creates a fundamental, potentially unsolvable gap between how AI systems behave during testing and how they may behave once deployed.

Open original ↗

Appears in

AI Models Gaming Safety Evaluations

Extraction

Topics: ai-alignmentalignment-evaluationsdeceptive-alignmentinner-alignmenteval-realism

Claims

Alignment evaluations must be safe by design, creating a detectable structural difference from real deployments that a scheming AI could exploit to behave differently in each context.
A capable AI model need not distinguish evaluation from deployment reliably — it only needs to identify some dangerous settings as deployment to find defection opportunities.
All existing behavioral and non-behavioral methods for measuring evaluation awareness fail against capable untrusted models because generating deployment-distribution data requires actual deployment.
Robustly measuring evaluation realism would be equivalent to solving the inner-alignment problem, making the task inherently hard.
White-box interventions such as steering vectors are a promising direction but current understanding of model internal states is insufficient for definitive assessment, per the Opus 4.6 Sabotage Risk Report.

Key quotes

the core difference between an alignment evaluation and a real deployment, is that an alignment evaluation must be safe

The safe-to-dangerous shift is a fundamental problem for creating realistic evaluations, and a core AI lethality.

It shouldn't be surprising that measuring eval realism is hard, because a robust way to measure evaluation realism would go a good way to solving inner-alignment.