LLMs often cannot tell when an attack made them say something unsafe.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-24

Research highlighted by Rohan Paul finds that LLMs cannot reliably detect when adversarial prefill attacks have caused them to generate unsafe content, rendering LLM self-evaluation an ineffective safety mechanism.

Open original ↗

Extraction

Topics: llm-safetyadversarial-attacksai-securityjailbreaking

Claims

LLMs often cannot detect when an adversarial attack caused them to produce unsafe output.
Asking an LLM to evaluate whether its own previous response was compromised is not a reliable safety mechanism.
Adversarial prefill attacks work by supplying a harmful opening line that the model then continues from, bypassing its safety training.

Key quotes

LLMs often cannot tell when an attack made them say something unsafe.

Asking an LLM whether its own previous answer was compromised is not a dependable safety check.

An adversarial prefill happens when the model is given a harmful opening line, then continues from that line.