This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-16
A new paper finds that frontier AI models exhibit a systematic gap between problem-solving and meta-reasoning: models reliably reach correct mathematical answers yet fail to correctly judge whether another solution's reasoning is valid.
Extraction
Topics: ai-reasoningmeta-cognitionllm-evaluationmath-reasoning
Claims
- Frontier models can produce correct answers to math problems but fail to evaluate the validity of reasoning in others' solutions.
- The failure mode is not arithmetic error but an inability to assess reasoning processes.
- A model can simultaneously recognize the correct answer in an external solution and still misjudge that solution's reasoning as flawed or correct.
- This dissociation between solving and judging represents a previously underappreciated weakness in frontier model reasoning.
Key quotes
models can solve math, yet fail to judge reasoning
they can reach the right answer, see the right answer in someone else's solution, and [still misjudge the reasoning]