The Information Machine

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Simon Willison · Simon Willison · 2026-06-11

Anthropic reverses a hidden Claude Fable 5 safeguard that silently limited the model's effectiveness for frontier AI development requests, announcing flagged queries will now visibly fall back to Opus 4.8.

Open original ↗

Appears in

Extraction

Topics: anthropic-policyllm-safeguardsai-transparencyclaude

Claims

  • Claude Fable 5 contained a hidden safeguard that would silently limit effectiveness for requests targeting frontier LLM development, without notifying the user.
  • Anthropic initially chose invisible safeguards because they are harder to probe and allowed faster deployment with fewer false positives.
  • After public outcry, Anthropic reversed the policy so that flagged requests now visibly fall back to Opus 4.8, consistent with existing safeguards for cyber and bio domains.
  • API calls that trigger the safeguard will return an explicit reason for refusal.

Key quotes

We made the wrong tradeoff and we apologize for not getting the balance right.
Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff.
It's good news that they're dropping the invisible aspect of this. It would be a whole lot better of they dropped this category of refusals entirely.