LLMs often cannot tell when an attack made them say something unsafe.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-24
Research highlighted by Rohan Paul finds that LLMs cannot reliably detect when adversarial prefill attacks have caused them to generate unsafe content, rendering LLM self-evaluation an ineffective safety mechanism.
Extraction
Topics: llm-safetyadversarial-attacksai-securityjailbreaking
Claims
- LLMs often cannot detect when an adversarial attack caused them to produce unsafe output.
- Asking an LLM to evaluate whether its own previous response was compromised is not a reliable safety mechanism.
- Adversarial prefill attacks work by supplying a harmful opening line that the model then continues from, bypassing its safety training.
Key quotes
LLMs often cannot tell when an attack made them say something unsafe.
Asking an LLM whether its own previous answer was compromised is not a dependable safety check.
An adversarial prefill happens when the model is given a harmful opening line, then continues from that line.