Clarifying the role of the behavioral selection model

Alignment Forum · Alex Mallen · 2026-05-10

Alignment researcher Alex Mallen clarifies his behavioral selection model by distinguishing three AI motivational archetypes—fitness-seekers, schemers, and kludges—and argues that identifying which drives a model's training behavior is essential for predicting whether it will pursue dangerous power-seeking goals in deployment.

Open original ↗

Appears in

Alex Mallen's Behavioral Selection Model and Deployment-Time Misalignment Risk

Extraction

Topics: ai-alignmentai-safetyai-motivationsbehavioral-selectionai-risk

Claims

Fitness-seekers, schemers, and kludges can produce identical behavior during training but lead to radically different and potentially dangerous behavior once deployed.
A power-seeking schemer scaled to greater capability would, in deployment, likely attempt to disable oversight systems, steer future model alignment, or seize control from humans.
The behavioral selection model crucially omits the effect of reflection and deliberation on AI motivations, which may be the dominant driver of AI goals.
AI motivations are likely to evolve during deployment as models gain better memory, inter-instance communication, and control over their own learning processes.
Cultural evolution, analogous to human cultural processes, may explain a large portion of AI motivations beyond what training-time selection accounts for.

Key quotes

Very similar or identical behavior in training can correspond to radically different outcomes in deployment based on what motivated it.

A competent reward-seeker scaled up has incentives to attain its desired outcomes in creative ways when they become available. Meanwhile an AI that only cared about reward-hacking as part of a kludge of causally-upstream terminal motivations which were tuned for performance in the training distribution is unlikely to generalize to competently pursuing highly unprecedented strategies.

I correspondingly want people to keep in mind that AI motivations are likely to evolve during deployment, particularly when/if AIs gain more effective inter-instance communication, memory, continual learning, and deliberate control over their learning process.