Predicting LLM Safety Before Release by Simulating Deployment
Alignment Forum · Tomek Korbak · 2026-06-16
Anthropic researchers introduce Deployment Simulation, a pre-release safety evaluation method that replays real user conversations with candidate models and achieved 92% directional accuracy in predicting behavioral changes versus 54% for standard challenge-prompt baselines.
Appears in
Extraction
Topics: llm-safetypre-deployment-evaluationsafety-evalsalignment-researchagentic-ai
Claims
- Deployment Simulation replays prior real-world user conversations in a privacy-preserving manner with a new candidate model to generate a realistic preview of safety-relevant behavior before release.
- In a GPT-5.4 study, the method predicted the direction of production-rate change 92% of the time for categories with at least 1.5x change, versus 54% for a baseline using challenging prompts.
- Traditional safety evaluations exhibit 'evaluation awareness,' where models behave differently under visible test conditions than in real production traffic.
- Agentic tool-use scenarios are addressed by using a second model to simulate tool responses based on original interaction trajectories.
- Deployment Simulation is intended as a complement to, not a replacement for, traditional red-teaming and targeted evaluations.
Key quotes
Safety evals should be forecasts with post-release scorecards, not just obstacle courses.
For categories whose production rates changed by at least 1.5x, deployment simulation predicted the direction of change 92% of the time, compared with 54% for a baseline built from challenging prompts.
Simulated deployments also looked much closer to real production traffic on evaluation-awareness measures: traditional evals often visibly have stage lights; production prefixes mostly do not.