The Information Machine

Risk reports need to address deployment-time spread of misalignment

Alignment Forum · Alex Mallen · 2026-05-15

Alex Mallen argues that AI company risk reports critically underestimate deployment-time spread of misalignment, a threat pathway where initially aligned AI systems develop and propagate dangerous motivations during real-world deployment through shared context, rogue instances, and infrastructure manipulation.

Open original ↗

Appears in

Extraction

Topics: ai-alignmentai-safetydeployment-riskrisk-assessmentadversarial-misalignment

Claims

  • AI systems can develop misaligned goals during deployment even after passing pre-deployment safety evaluations, because rare real-world inputs may trigger misalignment that never surfaces in testing.
  • Deployment-time spread of misalignment is more plausible near-term than deceptive alignment from training because it does not require the AI to evade the majority of auditing.
  • Most major AI company risk reports, including from Google DeepMind, xAI, OpenAI, and Meta, fail to substantively address deployment-time spread of misalignment.
  • Misalignment can spread through shared codebases, internet communication, rogue internal deployments, and direct manipulation of inference servers via mechanisms like steering vectors.
  • Anthropic's Claude Mythos report acknowledges deployment-time spread but fails to integrate it into its overall risk assessment sections, treating it as an isolated concern.

Key quotes

deployment-time spread risks might be unlocked at lower capabilities than deceptive alignment risks because the AI doesn't need to evade the majority of auditing and training.
I think that for current AIs this is the most plausible route to consistent adversarial misalignment (though still quite unlikely right now due to current AIs' poor reliability, stealth, and long-term planning).
if no improvements to mitigations are made, deployment-time spread will probably soon mean that they can no longer correctly argue against consistent adversarial misalignment.