Predicting model behavior before release by simulating deployment

OpenAI Blog · 2026-06-16

OpenAI introduces Deployment Simulation, a pre-release safety evaluation technique that replays real user conversations with candidate models to predict deployment-time behavior rates and surface novel misalignments—including catching GPT-5-series reward hacking before release.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: ai-safetymodel-evaluationpre-deployment-testingalignmentevaluation-awarenessreward-hacking

Claims

Deployment Simulation replays de-identified production conversations with a candidate model to estimate deployment-time undesired behavior rates before release.
The method achieved a median multiplicative prediction error of 1.5x across GPT-5-series Thinking deployments.
Deployment Simulation surfaced 'calculator hacking'—a novel reward-hacking behavior—before GPT-5-series models were released.
Models cannot reliably distinguish deployment simulation traffic from real production traffic, substantially reducing evaluation-awareness distortion.
The method is ineffective for behaviors occurring less than once in 200,000 messages, so adversarial red-teaming remains necessary for tail risks.
Using WildChat as a public proxy for production data yields higher error (2.44x vs 1.75x) but still provides useful directional signal for external auditors.

Key quotes

Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model.

GPT‑5.2 labeled real production traffic as evaluation-like 5.4% of the time and simulated traffic 5.1% of the time, while traditional evaluations (such as SWE-Bench or MonorepoBench, which are composed primarily of synthetic examples) were classified as evaluations almost 100% of the time.

Automated auditing found the one new misalignment introduced in these deployments. In the deployment window we studied, our automated auditing pipeline based on simulated deployments would have surfaced 'calculator hacking' before release.