The Information Machine

Prompt Injection as Role Confusion

Simon Willison · Simon Willison · 2026-06-22

Simon Willison covers research by Ye, Cui, and Hadfield-Menell showing that LLMs classify text by stylistic similarity to role tags rather than tag content, enabling jailbreaks that drop attack success from 61% to 10% when text is "destyled."

Open original ↗

Extraction

Topics: prompt-injectionllm-securityjailbreakingrole-confusionai-safety

Claims

  • LLMs treat text as privileged or untrusted based on its style rather than the actual role tags wrapping it.
  • Text that mimics the format of a model's internal thinking blocks can override safety training, even for harmful requests.
  • Rewriting injected text to not match the expected role-tag style (destyling) drops average attack success from 61% to 10%.
  • Genuine role perception by LLMs is not currently achieved, making injection defense a perpetual whack-a-mole problem.
  • Subtle stylistic injections could shift model behavior at scale without triggering obvious detection.

Key quotes

To a human reader, these two versions say the same thing. But to the LLM, the difference is enormous: destyling causes average attack success in our dataset to plunge from 61% to 10%. A change nearly invisible to humans completely changes the LLM's role perception.
Unless LLMs achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game.
The continuous nature of role boundaries opens the threat of injections designed to subtly shift LLM states through seemingly innocuous text, legally and at scale.