AI Systems Achieve Verifiable Mathematical Reasoning

closed · v7 · 2026-06-08 · 86 items · history

What's new in v7

The main addition is concrete year-over-year competition data: AI models that failed USAMO 2025 are now performing strongly on USAMO 2026 [14][15], providing empirical support for Rohan Paul's benchmark-saturation prediction, and a Manifold prediction market now tracks whether a perfect IMO 2026 score will be achieved [17]. DeepMind's CEO has explicitly ruled out AGI framing for the math achievements [21] — a new organizational position that replaces the prior sixth tension (bullish timelines vs. cautious published research) with a sharper version: the producing organization itself contra the most expansive outside predictions.

What

Two AI systems have produced expert-validated results in open research mathematics. OpenAI's reasoning model disproved the Erdős unit distance conjecture (open since 1946) [3], confirmed as a genuine milestone by Fields Medalist Tim Gowers and University of Toronto professor Daniel Litt [6]. Google DeepMind's AlphaProof Nexus solved 9 open Erdős problems and 44 OEIS conjectures [7][8][9]. At competition level, both systems achieved gold-medal performance at IMO 2025 [1][2], and AI models that failed USAMO 2025 are now performing strongly on USAMO 2026 [14][15] — a year-over-year reversal that supports predictions of rapid benchmark saturation [16]. A Manifold prediction market now tracks whether a perfect IMO 2026 score will be achieved [17].

Why it matters

Expert mathematician validation now attaches to specific AI-produced results, not just benchmark scores. The USAMO year-over-year reversal adds concrete data to the claim that competition math is nearing its ceiling as an AI evaluation framework, sharpening the question of what replaces it.

Open questions

AlphaProof Nexus solved 9 Erdős problems and 44 OEIS conjectures [7][8] — what are the specific problems, how do they compare in difficulty to the OpenAI unit distance disproof, and how has independent verification proceeded?
AI models went from failing USAMO 2025 to strong performance on USAMO 2026 [14][15]; if competition benchmarks near saturation within a year as Rohan Paul predicts [16], what evaluation frameworks can distinguish degrees of difficulty in research-level AI mathematics?
Tao's concern about 'mass-produced mathematics at scale' [20] — has the mathematical community or any institution developed a systematic response to review-capacity pressure as systems resolve dozens of conjectures simultaneously?
DeepMind's CEO explicitly rules out AGI framing for these math achievements [21] — how do researchers distinguish progress in mathematical reasoning from broader general intelligence claims, and does that distinction hold as systems tackle more open-ended research problems?

Narrative

In 2025 and early 2026, AI systems reached levels of mathematical performance that had previously seemed distant. Google DeepMind and OpenAI both achieved gold-medal-level performance at the 2025 International Mathematical Olympiad [1][2] — the first AI systems to clear that threshold. Competition mathematics, where problems are pre-selected and success criteria are fixed, had been the primary measure of AI mathematical capability. The more contested question has been whether AI can produce genuine research mathematics: open problems, no curated problem set, no guaranteed solution exists.

On that question, expert mathematician endorsement has accumulated around OpenAI's Erdős unit distance disproof. The result — a counterexample to a discrete-geometry conjecture open since 1946, produced by a general-purpose reasoning model with no formal-system scaffolding [3] — appeared in an arXiv preprint [4] and was acknowledged by combinatorialist Gil Kalai [5]. Fields Medal laureate Tim Gowers stated 'there is no doubt that the solution to the unit-distance problem is a milestone in AI mathematics' [6]; University of Toronto professor Daniel Litt called it 'the first example of a result produced autonomously by an AI that I find exciting in itself, as opposed to as a leading indicator' [6]. Google DeepMind's AlphaProof Nexus produced results at greater volume: 9 open Erdős problems and 44 OEIS conjectures [7][8][9] — a scale that Zvi Mowshowitz considers a landmark milestone but one that received almost no mainstream media coverage.

The methodological gap between the two approaches remains substantive. DeepMind's architecture pairs large language models with the Lean theorem prover, formally verifying each reasoning step [10][11]. OpenAI's Erdős counterexample came from a general-purpose model with no such scaffolding — it survived external mathematician review but carries no machine-checkable audit trail. A Google research paper found that breaking proof-writing into planning and step-by-step checking raised general LLM performance on formal math benchmarks from under 10% to 70% [12], suggesting structured decomposition may be the operative mechanism even without full Lean verification. The University of Washington Math AI Lab's dataset of erroneous Lean proofs [13] adds pressure on the formal-verification side: if the verification layer itself produces false proofs, formal-system grounding may be less reliable than the architecture implies.

The trajectory of competition math performance is now backed by concrete year-over-year data: AI models that failed USAMO 2025 are performing strongly on USAMO 2026 [14][15], consistent with Rohan Paul's prediction that competition benchmarks are approaching obsolescence and that a perfect IMO score will be achieved within a year [16]. A Manifold prediction market is now tracking that threshold [17]. Terence Tao, who actively curates AI contributions on his Erdős problems GitHub wiki [18] and has personally verified several AI-assisted results [19], has raised a counterpoint: 'more and more mass-produced mathematics at scale' [20] is a quality concern whose weight grows as systems announce dozens of conjecture resolutions simultaneously. DeepMind's CEO has explicitly ruled out AGI framing for these math achievements [21] — a notable hedge from the organization whose system is resolving open Erdős problems at scale, and one that sits directly against the more expansive timelines proposed by outside commentators.

Timeline

1946-01-01: Paul Erdős poses the unit distance conjecture in discrete geometry [3]
2024-07-01: DeepMind's AlphaProof and AlphaGeometry 2 earn silver medal at IMO 2024 [33]
2025-04-01: AI models fail USAMO 2025; performance described as miserable relative to competition standard [14]
2025-07-21: NYT reports Google AI wins gold at IMO 2025; OpenAI also claims gold — first AI systems to clear this threshold [1][2][23]
2025-08-03: Xena Project publishes mathematical community round-up of AI performance at IMO 2025 [31]
2026-02-01: The Atlantic publishes longform profile on how Terence Tao uses AI in his research [24]
2026-05-20: OpenAI announces unreleased reasoning model disproved the Erdős unit distance conjecture [3][27]
2026-05-21: Gil Kalai acknowledges disproof as 'Amazing'; widespread media amplification including The Guardian [5][34][35]
2026-05-22: arXiv preprint on the disproof appears; formal verification debate surfaces in coverage [4][10]
2026-05-23: Three Erdős problems fall in seven days with Tao verifying each; Gary Marcus publishes critical review of AI math headlines [19][26]
2026-05-24: Tao's GitHub wiki tracking AI contributions confirmed; Tao quoted on 'mass-produced mathematics at scale'; AlphaProof Nexus first mentioned [18][20][36]
2026-05-28: Zvi Mowshowitz reports AlphaProof Nexus solved 9 Erdős problems and 44 OEIS conjectures; notes landmark result with almost no mainstream media coverage [8][7]
2026-06-01: Ars Technica: Fields Medalist Tim Gowers and Daniel Litt explicitly validate OpenAI's Erdős result as substantively exciting [6]
2026-06-04: Google research paper: structured proof-writing raises general LLM formal math performance from under 10% to 70% [12]
2026-06-07: Rohan Paul predicts competition math benchmarks near obsolescence; expects a model with a perfect IMO score within one year [16]
2026-06-08: Reports confirm AI models that failed USAMO 2025 now perform strongly on USAMO 2026; prediction market opens on perfect IMO 2026 score; DeepMind CEO rules out AGI framing for math achievements [14][15][17][21]

Perspectives

OpenAI

A general-purpose reasoning model with no mathematical specialization disproved an 80-year-old open conjecture and achieved gold-medal performance at IMO 2025, demonstrating discovery capability without formal-system scaffolding.

Evolution: Stronger: Fields Medalist Tim Gowers and Daniel Litt have explicitly validated the Erdős result as substantively exciting, moving expert endorsement from acknowledgment to enthusiasm.

[3][4][22][2][23][6]

Google DeepMind

Achieved gold at IMO 2025, maintains a Lean-grounded theorem-proving architecture where every step is formally verified, and AlphaProof Nexus has solved 9 open Erdős problems and 44 OEIS conjectures — while the CEO explicitly frames these achievements as separate from AGI.

Evolution: Added: the CEO now publicly distances the math achievements from AGI claims, providing an organizational hedge against the most expansive interpretations of AlphaProof Nexus's results.

[10][1][23][7][8][9][21]

Terence Tao

Actively curates AI contributions on his Erdős problems GitHub wiki and has personally verified several AI-assisted results, while publicly expressing concern about 'more and more mass-produced mathematics at scale.'

Evolution: Consistent dual position as engaged participant and quality skeptic; the quality concern gains weight as systems resolve dozens of conjectures simultaneously.

[18][24][19][20][25]

Tim Gowers and Daniel Litt (mathematician validators)

Both explicitly confirm the OpenAI Erdős disproof as substantively significant — Gowers calls it 'a milestone in AI mathematics'; Litt calls it the first AI result he finds 'exciting in itself, as opposed to as a leading indicator.'

Evolution: Consistent since entering the thread; these remain the most senior mathematician endorsements of any specific AI math result to date.

[6]

Rohan Paul (AI commentator)

Competition math benchmarks are approaching obsolescence; expects a model capable of a perfect IMO score within one year, and frames structured proof-decomposition as the key architectural mechanism behind AI's formal math progress.

Evolution: The USAMO 2025-to-2026 reversal provides concrete year-over-year data supporting his timeline; a prediction market now tracks the IMO threshold he named.

[10][11][12][16][14][17]

Gary Marcus

Explicitly auditing whether AI math headlines match underlying results; skeptical of headline claims about OpenAI and Anthropic mathematical achievements.

Evolution: Consistent; now positioned against explicit Fields Medalist endorsement of the OpenAI result, making the skeptic-vs-validator contrast sharper.

[26]

Harmonic (Tudor Achim)

Formal verification is the key epistemological shift enabling trustworthy AI mathematics; AI could prove the Riemann Hypothesis by 2028.

Evolution: Consistent and promotional; no new substantive claims beyond the founding thesis.

[27][28][29][30]

Mathematical community (Zvi, Xena Project, LessWrong, Hacker News)

Fragmented: Zvi notes that landmark results like AlphaProof Nexus receive almost no mainstream coverage despite their magnitude; Xena Project provides formal-verification-community context; LessWrong and Hacker News probe what verification actually guarantees.

Evolution: Consistent; the coverage-gap observation remains a notable feature of the story.

[25][31][32][13][8]

Tensions

OpenAI's Erdős disproof came from a general-purpose model with no formal-system grounding; DeepMind's architecture requires every reasoning step to be Lean-verified — both have produced expert-validated results, but they represent different claims about what makes AI mathematical output trustworthy. [3][10][11][4][2][23][6]
Gary Marcus explicitly audits AI math headlines against underlying results [26], while Fields Medalist Tim Gowers and Daniel Litt have directly endorsed the OpenAI Erdős result as substantively exciting [6] — the most direct clash between the skeptical and endorsing positions. [26][6][3]
Tao observes 'more and more mass-produced mathematics at scale' [20] as a quality concern; Rohan Paul frames high-volume proof-search as precisely the mechanism for mathematical progress and expects benchmark saturation within a year [16] — the same phenomenon read as a problem versus a feature. [20][16][12]
AlphaProof Nexus solved 9 Erdős problems and 44 OEIS conjectures — results Zvi calls a landmark milestone — but received almost no mainstream coverage [8], while OpenAI's single Erdős disproof generated sustained press attention; the coverage gap does not track the volume of results. [8][7][3][22]
DeepMind's CEO explicitly rules out AGI framing for the math achievements [21], while Harmonic's Tudor Achim predicts AI could prove the Riemann Hypothesis by 2028 [27] and Rohan Paul expects a perfect IMO score within a year [16] — the producing organization and outside commentators draw opposite conclusions from the same results. [21][27][16]
The UW Math AI Lab's dataset of erroneous Lean proofs challenges the 'every step formally verified' claim central to DeepMind's and Harmonic's architectures — if the verification layer itself produces false proofs, formal-system grounding may be less reliable than claimed. [13][10][27][11]

Status: active and growing

Sources

[1] Google A.I. System Wins Gold Medal in International Math Olympiad — reactive:ai-formal-math-breakthroughs
[2] OpenAI claims gold-medal performance at IMO 2025 | Hacker News — reactive:ai-formal-math-breakthroughs
[3] 😸 OpenAI solved an 80-year math problem by... disproving it — The Neuron (2026-05-22)
[4] Remarks on the disproof of the unit distance conjecture - arXiv — reactive:openai-erdos-math-breakthrough
[5] Amazing: Erdős' Unit Distance Problem was Disproved! It was ... — reactive:openai-erdos-math-breakthrough
[6] An OpenAI model solved a famous math problem that stumped humans for 80 years — Ars Technica AI (2026-06-01)
[7] Google DeepMind's AlphaProof Nexus Solves 9 Erdős Problems and 44 OEIS Conjectures | KuCoin — reactive:ai-formal-math-breakthroughs
[8] AI #170: Lack of Executive Order — Zvi's AI Roundups (2026-05-28)
[9] DeepMind AlphaProof Nexus solves 9 Erdős problems - AI Weekly — reactive:ai-formal-math-breakthroughs
[10] Google DeepMind's new paper. — Rohan Paul Twitter (2026-05-22)
[11] @tomflex @prz_chojecki Sure! DeepMind built AI agents that pair LLMs (for generating ideas) with the Lean theorem prover... — reactive:ai-formal-math-breakthroughs (2026-05-24)
[12] Another great paper from Google. — Rohan Paul Twitter (2026-06-04)
[13] UW Math AI Lab Releases Erroneous Lean Proofs Dataset - LinkedIn — reactive:ai-formal-math-breakthroughs
[14] Last year, models miserably failed on USAMO 2025. In 2026, GPT ... — reactive:ai-formal-math-breakthroughs
[15] USAMO 2026 — reactive:ai-formal-math-breakthroughs
[16] "Pretty soon, competition math, competition coding, is not going to be interesting anymore. — Rohan Paul Twitter (2026-06-07)
[17] Perfect score achieved by an AI model in the International Math Olympiad (IMO) 2026? | Manifold — reactive:ai-formal-math-breakthroughs
[18] AI contributions to Erdős problems · teorth/erdosproblems Wiki — reactive:ai-formal-math-breakthroughs
[19] Three Erdős Problems Fell in Seven Days, and Terence Tao Verified ... — reactive:ai-formal-math-breakthroughs
[20] “I do see more and more mass-produced mathematics at scale." — Rohan Paul Twitter (2026-05-24)
[21] DeepMind AI solves decades-old math problems but CEO rules out ... — reactive:ai-formal-math-breakthroughs
[22] OpenAI announces AI's biggest math breakthrough yet — reactive:openai-erdos-math-breakthrough
[23] Google and OpenAI Get 2025 IMO Gold - LessWrong — reactive:ai-formal-math-breakthroughs
[24] The Edge of Mathematics - The Atlantic — reactive:ai-formal-math-breakthroughs
[25] Now that it's 2026, how is Terence Tao's prediction holding up? : r/math — reactive:openai-erdos-math-breakthrough
[26] Checking the math behind OpenAI and Anthropic's latest headlines — reactive:ai-formal-math-breakthroughs
[27] 😺 🎙️ PODCAST: Can AI Solve Math's Biggest Mystery? — The Neuron (2026-05-20)
[28] [PDF] Aristotle: IMO-level Automated Theorem Proving - arXiv — reactive:ai-formal-math-breakthroughs
[29] Aristotle from Harmonic just proved Erdos Problem #124 in Lean all ... — reactive:ai-formal-math-breakthroughs
[30] Harmonic — reactive:ai-formal-math-breakthroughs
[31] AI at IMO 2025: a round-up - Xena Project - WordPress.com — reactive:ai-formal-math-breakthroughs
[32] Gemini with Deep Think achieves gold-medal standard at the IMO | Hacker News — reactive:ai-formal-math-breakthroughs
[33] Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2 ... — reactive:ai-formal-math-breakthroughs
[34] OpenAI makes breakthrough on 80-year-old maths problem — reactive:openai-erdos-math-breakthrough
[35] OpenAI's internal model disproves Unit Distance Conjecture of Erdos — reactive:openai-erdos-math-breakthrough
[36] @tobiamure @Polymarket This is today's big news from Google DeepMind: their new AI agent (AlphaProof Nexus) autonomously... — reactive:ai-formal-math-breakthroughs (2026-05-24)