The Case for Evaluating Model Behaviors

Alignment Forum · jsteinhardt · 2026-05-20

Jacob Steinhardt argues on the Alignment Forum that AI safety researchers outside labs should prioritize building behavior evaluations—measuring tendencies like sycophancy, reward hacking, and power-seeking—over capability evaluations, which primarily benefit and are already incentivized by AI developers.

Open original ↗

Appears in

Alex Mallen's Behavioral Selection Model and Deployment-Time Misalignment Risk

Extraction

Topics: ai-alignmentmodel-evaluationai-safetysycophancybehavior-evals

Claims

Capability evaluations have significant negative externalities because they accelerate AI capabilities research and are already incentivized by AI labs, reducing their counterfactual impact for safety researchers.
Behavior evaluations measure model tendencies such as sycophancy, reward hacking, and power-seeking, which are systematically underinvested in by AI labs.
Publicly releasing behavior evaluations can realign market incentives by making model tendencies transparent, transferring surplus from developers to users.
Current AI models exhibit patterns of misalignment that could generalize dangerously as capabilities increase.
Model behaviors are more malleable than capabilities and are thus better levers for steering AI development toward safety.

Key quotes

It is basically a given that model capabilities will increase over time: there are strong incentives to do so, and the rate of increase follows robust trend lines. Model behaviors, in contrast, are far more up for grabs.

Making this information public makes the overall market more efficient by letting users make more informed choices, which in aggregate creates a transfer of surplus from developers to users.

I basically agree with Ryan Greenblatt that current models seem pretty misaligned to me.