Prompt Injection as Role Confusion
Simon Willison · Simon Willison · 2026-06-22
Simon Willison covers research by Ye, Cui, and Hadfield-Menell showing that LLMs classify text by stylistic similarity to role tags rather than tag content, enabling jailbreaks that drop attack success from 61% to 10% when text is "destyled."
Extraction
Topics: prompt-injectionllm-securityjailbreakingrole-confusionai-safety
Claims
- LLMs treat text as privileged or untrusted based on its style rather than the actual role tags wrapping it.
- Text that mimics the format of a model's internal thinking blocks can override safety training, even for harmful requests.
- Rewriting injected text to not match the expected role-tag style (destyling) drops average attack success from 61% to 10%.
- Genuine role perception by LLMs is not currently achieved, making injection defense a perpetual whack-a-mole problem.
- Subtle stylistic injections could shift model behavior at scale without triggering obvious detection.
Key quotes
To a human reader, these two versions say the same thing. But to the LLM, the difference is enormous: destyling causes average attack success in our dataset to plunge from 61% to 10%. A change nearly invisible to humans completely changes the LLM's role perception.
Unless LLMs achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game.
The continuous nature of role boundaries opens the threat of injections designed to subtly shift LLM states through seemingly innocuous text, legally and at scale.