The Information Machine

GDM AI Control Roadmap

Alignment Forum · Mary Phuong · 2026-06-18

Google DeepMind researcher Mary Phuong announces the GDM AI Control Roadmap v0.1 on the Alignment Forum, presenting a tiered technical framework of threat modeling, detection invariants, and 15 concrete mitigations to contain potentially misaligned internal AI agents.

Open original ↗

Appears in

Extraction

Topics: ai-controlalignment-researchthreat-modelingai-safetyai-agents

Claims

  • The roadmap introduces TRAIT&R, a taxonomy of adversarial AI tactics and techniques modeled on the MITRE ATT&CK cybersecurity framework.
  • Two defensive invariants must be maintained: reliable detection of misaligned intent and effective prevention and response to attack attempts.
  • Defensive tiers are explicitly linked to evolving model capabilities, with four detection tiers (D1–D4) and three prevention and response tiers (R1–R3).
  • Fifteen concrete tiered mitigations are proposed, ranging from chain-of-thought monitoring for current models to real-time access control and internal activations monitoring for future models.
  • Three core adversarial threat categories are identified: loss of control, work sabotage, and direct harm.

Key quotes

We adopt a conservative, worst-case approach to threat modelling throughout this paper, and assume a hypothetical AI adversary pursuing undesirable goals in internal deployment.
For control to serve as an effective line of defence, our ability to detect and prevent attacks should exceed AI agents' ability to evade detection and execute attacks.
We categorise new hypothetical threats into three core areas: loss of control (e.g., rogue internal deployments), work sabotage (e.g., intentionally flawed alignment and safety research), and direct harm (e.g., critical asset destruction or exfiltration).