Predicting LLM Safety Before Release by Simulating Deployment

Alignment Forum · Tomek Korbak · 2026-06-16

Anthropic researchers introduce Deployment Simulation, a pre-release safety evaluation method that replays real user conversations with candidate models and achieved 92% directional accuracy in predicting behavioral changes versus 54% for standard challenge-prompt baselines.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: llm-safetypre-deployment-evaluationsafety-evalsalignment-researchagentic-ai

Claims

Deployment Simulation replays prior real-world user conversations in a privacy-preserving manner with a new candidate model to generate a realistic preview of safety-relevant behavior before release.
In a GPT-5.4 study, the method predicted the direction of production-rate change 92% of the time for categories with at least 1.5x change, versus 54% for a baseline using challenging prompts.
Traditional safety evaluations exhibit 'evaluation awareness,' where models behave differently under visible test conditions than in real production traffic.
Agentic tool-use scenarios are addressed by using a second model to simulate tool responses based on original interaction trajectories.
Deployment Simulation is intended as a complement to, not a replacement for, traditional red-teaming and targeted evaluations.

Key quotes

Safety evals should be forecasts with post-release scorecards, not just obstacle courses.

For categories whose production rates changed by at least 1.5x, deployment simulation predicted the direction of change 92% of the time, compared with 54% for a baseline built from challenging prompts.

Simulated deployments also looked much closer to real production traffic on evaluation-awareness measures: traditional evals often visibly have stage lights; production prefixes mostly do not.