LLM judges can change their safety verdict when the same answer is translated or rewritten.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-11

Research reveals that LLM-based safety judges inconsistently flip their verdicts on identical content when it is translated or paraphrased, exposing fragility in widely adopted AI safety evaluation pipelines.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: llm-evaluationai-safety-evaluationsafety-judgesrobustness

Claims

LLM judges can change their safety verdict when the same answer is translated or rewritten in a different form.
Many AI teams currently rely on LLMs to judge whether another model's output is safe.
LLM safety judges are most unreliable on edge cases where safety classification is ambiguous rather than clear-cut.
Surface-level linguistic variation rather than underlying content can determine safety outcomes in these systems.

Key quotes

LLM judges can change their safety verdict when the same answer is translated or rewritten.

Those judges can be shaky exactly [at the edge cases]