This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-16

A new paper finds that frontier AI models exhibit a systematic gap between problem-solving and meta-reasoning: models reliably reach correct mathematical answers yet fail to correctly judge whether another solution's reasoning is valid.

Open original ↗

Appears in

AI Agents Underperform Real-World Tasks: CAPTCHAs, Expert Benchmarks, and Memory Quality Failures

Extraction

Topics: ai-reasoningmeta-cognitionllm-evaluationmath-reasoning

Claims

Frontier models can produce correct answers to math problems but fail to evaluate the validity of reasoning in others' solutions.
The failure mode is not arithmetic error but an inability to assess reasoning processes.
A model can simultaneously recognize the correct answer in an external solution and still misjudge that solution's reasoning as flawed or correct.
This dissociation between solving and judging represents a previously underappreciated weakness in frontier model reasoning.

Key quotes

models can solve math, yet fail to judge reasoning

they can reach the right answer, see the right answer in someone else's solution, and [still misjudge the reasoning]