Sympathy for both sides of the egregious misalignment debate

Alignment Forum · Steven Byrnes · 2026-06-12

Alignment researcher Steven Byrnes argues that both the Yudkowsky/Soares position (superintelligent AI will be catastrophically misaligned without breakthrough solutions) and the LLM-researcher position (existing alignment techniques are working) are each substantially correct, reconcilable by concluding that LLMs will not scale to superintelligence.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: ai-alignmentsuperintelligencellm-safetyexistential-riskalignment-debate

Claims

Careful analysis of superintelligent AI properties gives strong independent reasons to expect egregious misalignment and scheming, absent yet-to-be-invented technical breakthroughs.
Careful analysis of current LLMs gives equally strong reasons to think existing alignment techniques are adequate for current systems and may remain adequate going forward.
These two apparently contradictory views are reconcilable if LLMs do not scale to superintelligent AI, which Byrnes believes is likely.
Extending LLMs via RLVR or open-ended continual learning would require a ground truth signal amounting to an objective function, and sufficiently optimizing against it would dilute human-niceness from pretraining in favor of ruthless maximization.
Yudkowsky and Soares overstate the evidence from current LLMs as proof that technical alignment is fundamentally unsolved, and LLM researchers underestimate the risks introduced by future continual learning paradigms.

Key quotes

I think BOTH of the following are true: (1) If you really think carefully about the properties of ASI, you really do find good reasons to strongly expect it to be egregiously misaligned, scheming, and ruthless... (2) If you really think carefully about the properties of current LLMs, you really do find good reasons to think that existing technical alignment techniques are adequate now.

We can reconcile them by saying that LLMs won't scale to ASI.

when the LLM updates enough on that ground truth, then whatever human-niceness that the LLM inherited from pretraining will get diluted away in favor of ruthless maximization of that objective function.