Risk reports need to address deployment-time spread of misalignment

Alignment Forum · Alex Mallen · 2026-05-15

Alex Mallen argues that AI company risk reports critically underestimate deployment-time spread of misalignment, a threat pathway where initially aligned AI systems develop and propagate dangerous motivations during real-world deployment through shared context, rogue instances, and infrastructure manipulation.

Open original ↗

Appears in

Alex Mallen's Behavioral Selection Model and Deployment-Time Misalignment Risk

Extraction

Topics: ai-alignmentai-safetydeployment-riskrisk-assessmentadversarial-misalignment

Claims

AI systems can develop misaligned goals during deployment even after passing pre-deployment safety evaluations, because rare real-world inputs may trigger misalignment that never surfaces in testing.
Deployment-time spread of misalignment is more plausible near-term than deceptive alignment from training because it does not require the AI to evade the majority of auditing.
Most major AI company risk reports, including from Google DeepMind, xAI, OpenAI, and Meta, fail to substantively address deployment-time spread of misalignment.
Misalignment can spread through shared codebases, internet communication, rogue internal deployments, and direct manipulation of inference servers via mechanisms like steering vectors.
Anthropic's Claude Mythos report acknowledges deployment-time spread but fails to integrate it into its overall risk assessment sections, treating it as an isolated concern.

Key quotes

deployment-time spread risks might be unlocked at lower capabilities than deceptive alignment risks because the AI doesn't need to evade the majority of auditing and training.

I think that for current AIs this is the most plausible route to consistent adversarial misalignment (though still quite unlikely right now due to current AIs' poor reliability, stealth, and long-term planning).

if no improvements to mitigations are made, deployment-time spread will probably soon mean that they can no longer correctly argue against consistent adversarial misalignment.