My research agenda and work

Alignment Forum · Seth Herd · 2026-06-05

Seth Herd outlines his AI alignment research agenda, arguing that mechanistically predicting the architecture of the first takeover-capable LLM-based AGI is a neglected third path between prosaic alignment and agent foundations that can yield more targeted interventions.

Open original ↗

Appears in

AI Labs Simultaneously Acknowledge Recursive Self-Improvement Threshold

Extraction

Topics: ai-alignmentllm-architecturetakeover-capable-aialignment-stabilitycomputational-cognitive-neuroscience

Claims

The first takeover-capable AI will most likely be advanced LLMs augmented with persistent memory, executive function, and metacognition rather than a purely novel architecture.
Current alignment research neglects a third approach—mechanistically predicting properties of the first TCAI—in favor of either empirical study of current systems or fully general agent-foundations theory.
Adding continuous learning and executive function to LLMs will create new alignment instability through goal drift and ontology shift that existing techniques do not address.
Internal independent review, in which a separate model instance checks actions before execution, is a promising and likely-to-be-deployed alignment technique for agentic LLM scaffolds.
Motivated reasoning and confirmation bias are polarizing the AI safety community into optimist and doomer camps, creating a dangerous pool of expert opinions that policymakers can selectively invoke.

Key quotes

I'm trying to predict what the first transformative AI will be, in enough mechanistic detail that we can predict likely failure modes of its alignment.

I think there's a neglected third option: carefully predicting properties of the first AIs in which it's really important to get alignment right.

I think nobody knows how hard alignment is, and that's a dangerous position to be in.