LLM judges can change their safety verdict when the same answer is translated or rewritten.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-11
Research reveals that LLM-based safety judges inconsistently flip their verdicts on identical content when it is translated or paraphrased, exposing fragility in widely adopted AI safety evaluation pipelines.
Appears in
Extraction
Topics: llm-evaluationai-safety-evaluationsafety-judgesrobustness
Claims
- LLM judges can change their safety verdict when the same answer is translated or rewritten in a different form.
- Many AI teams currently rely on LLMs to judge whether another model's output is safe.
- LLM safety judges are most unreliable on edge cases where safety classification is ambiguous rather than clear-cut.
- Surface-level linguistic variation rather than underlying content can determine safety outcomes in these systems.
Key quotes
LLM judges can change their safety verdict when the same answer is translated or rewritten.
Those judges can be shaky exactly [at the edge cases]