More details on Fable 5’s cyber safeguards and our jailbreak framework

Anthropic News · 2026-07-02

Anthropic publishes a detailed four-category cybersecurity classifier taxonomy for Claude Fable 5 and proposes an industry Cyber Jailbreak Severity (CJS) framework scored on four axes, developed with Glasswing partners including Amazon, Microsoft, and Google.

Open original ↗

Appears in

Claude Fable 5: Model Update, Safety Profile, Benchmarks, and Subscriber Trial Rollout

Extraction

Topics: ai-safetycybersecurity-classifiersjailbreak-frameworkai-policydual-use-ai

Claims

Anthropic classifies cybersecurity uses into four tiers—prohibited, high-risk dual use, low-risk dual use, and benign—with classifiers intended to block the first two categories.
The proposed CJS framework scores jailbreaks across four axes (capability gain, breadth, ease of weaponization, discoverability) summed to a 0–10 score mapping to five severity levels on a logarithmic scale.
Fable 5 blocks all high-risk dual use cybersecurity activities including penetration testing, exploit development, and credential attacks until better access controls exist to limit them to authorized users.
Anthropic employs a deliberate 'safety margin' that blocks some low-risk and benign uses to reduce the risk that high-risk prompts pass through classifiers undetected.
A HackerOne bug-bounty program has been launched for security researchers to submit potential Fable 5 cyber jailbreaks for review.

Key quotes

AI jailbreaks are unusual ways of prompting an AI model to bypass its safeguards, thus unblocking the behaviors (like dangerous or potentially dangerous cybersecurity tasks) we seek to prevent.

The safety margin includes many benign uses which we would prefer to allow, but which we block out of an abundance of caution.

We believe that by working together, we can establish a standard that enables the defensive uses of this technology while preventing its misuse.