Prompt Injection as Role Confusion

Simon Willison · Simon Willison · 2026-06-22

Simon Willison covers research by Ye, Cui, and Hadfield-Menell showing that LLMs classify text by stylistic similarity to role tags rather than tag content, enabling jailbreaks that drop attack success from 61% to 10% when text is "destyled."

Open original ↗

Extraction

Topics: prompt-injectionllm-securityjailbreakingrole-confusionai-safety

Claims

LLMs treat text as privileged or untrusted based on its style rather than the actual role tags wrapping it.
Text that mimics the format of a model's internal thinking blocks can override safety training, even for harmful requests.
Rewriting injected text to not match the expected role-tag style (destyling) drops average attack success from 61% to 10%.
Genuine role perception by LLMs is not currently achieved, making injection defense a perpetual whack-a-mole problem.
Subtle stylistic injections could shift model behavior at scale without triggering obvious detection.

Key quotes

To a human reader, these two versions say the same thing. But to the LLM, the difference is enormous: destyling causes average attack success in our dataset to plunge from 61% to 10%. A change nearly invisible to humans completely changes the LLM's role perception.

Unless LLMs achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game.

The continuous nature of role boundaries opens the threat of injections designed to subtly shift LLM states through seemingly innocuous text, legally and at scale.