AI Systems Achieve Verifiable Mathematical Reasoning · history

Version 4

2026-05-26 08:25 UTC · 69 items

Changes since v3

Two significant additions this pass. First, LessWrong and Hacker News evidence confirms both Google and OpenAI achieved gold-medal performance at IMO 2025 [21034][21035], complicating the prior framing that OpenAI was 'under fire' as DeepMind claimed the gold—both now sit at the same competitive threshold, and the Xena Project's mathematical-community round-up [21036] adds expert-community credibility to those results. Second, Tao is quoted saying he sees 'more and more mass-produced mathematics at scale' [18631], introducing a mathematician-level quality concern that sits in productive tension with the capability gains being celebrated elsewhere; Rohan Paul frames this optimistically as AI proof-search at scale, but the concern about overwhelming peer review is new to this pass. A UW dataset of erroneous Lean proofs [21027] also adds new empirical evidence bearing on the reliability of formal-verification systems.

What

In late May 2026, AI-assisted mathematics reached a visible inflection point across multiple fronts. OpenAI's general-purpose reasoning model disproved the Erdős unit distance conjecture (open since 1946), verified by external mathematicians and documented in a formal arXiv preprint [2][4]. Both Google DeepMind and OpenAI are confirmed to have achieved gold-medal-level performance at the 2025 International Mathematical Olympiad [7][6][5]—the first AI systems to reach this threshold. Terence Tao is actively curating AI contributions to Erdős problems [9] while also sounding a quality alarm: he observes 'more and more mass-produced mathematics at scale' [12], a warning that capability growth may be outpacing mathematical quality control.

Why it matters

Two independent organizations have now produced results that survive expert scrutiny in distinct domains—open conjecture discovery and competition mathematics—within the same compressed window, and the world's most credentialed active mathematician is simultaneously curating AI's contributions and flagging quality concerns. If AI-generated proofs flood the literature faster than the mathematical community can review them, the reliability of formal verification becomes a first-order question rather than a selling point.

Open questions

Both Google and OpenAI are reported to have achieved gold at IMO 2025 [7][6]—but do their approaches differ in method, scoring breakdown, or problem coverage, and is one result more formally verified than the other?
Tao observes 'more and more mass-produced mathematics at scale' [12]—is this a quality concern about AI-generated proofs overwhelming peer review, and what institutional responses are emerging from the mathematical community?
The UW Math AI Lab released a dataset of erroneous Lean proofs [15]—what does this imply about reliability rates in formal-verification systems, and does it undercut the 'every step verified' claim central to architectures like AlphaProof?
A new Google DeepMind system called AlphaProof Nexus has been mentioned [19]—what capabilities does it add beyond AlphaProof, and is it associated with the IMO 2025 gold result?

Narrative

In late May 2026, a cluster of AI-assisted mathematical results drew sustained engagement from mathematicians, the scientific press, AI researchers, and a broader public. The anchoring event was OpenAI's announcement that an unreleased, general-purpose reasoning model—without mathematical specialization or formal-system scaffolding—had produced a counterexample disproving the Erdős unit distance conjecture, a discrete-geometry problem posed in 1946 [1]. Princeton mathematician Will Sawin subsequently sharpened the result, external mathematicians co-signed verification, and a formal arXiv preprint appeared [2]; prominent combinatorialist Gil Kalai, who had worked on closely related problems, acknowledged the result on his mathematics blog as 'Amazing' [3]. Scientific American framed it as 'AI's biggest math breakthrough yet' [4]. Set alongside this: both Google DeepMind and OpenAI had achieved gold-medal-level performance at the 2025 International Mathematical Olympiad [5][6][7]—confirmed by the New York Times, LessWrong, and Hacker News—a threshold no AI had previously reached. The Xena Project, embedded in the formal-verification community, published a detailed mathematical round-up of AI performance at IMO 2025 [8], lending expert-community credibility to results assessed mainly through media coverage.

The mathematician who has come most visibly into focus is Terence Tao. He maintains a public GitHub wiki on his erdosproblems repository tracking AI contributions [9], was profiled in The Atlantic on his AI workflow [10], and is reported to have personally verified several AI-assisted Erdős proofs [11]. But Tao has also introduced a note of caution: Rohan Paul quoted him observing 'I do see more and more mass-produced mathematics at scale' [12]. Paul interprets this optimistically—AI turns proof-writing into a search problem, generating thousands of candidate lemmas filtered by cheap automated checkers [12]—but the same framing raises a quality question: if the volume of AI-generated mathematics outpaces expert review, who validates the checkers?

The architectural debate about what these results demonstrate forms the central intellectual fault line. DeepMind's approach pairs large language models with the Lean theorem prover, each reasoning step formally verified before proceeding [13][14]—a constraint Rohan Paul argues confines success to 'carefully constrained worlds' rather than open-ended reasoning. A University of Washington Math AI Lab dataset of erroneous Lean proofs [15] adds new evidence bearing on the reliability of the verification layer itself. Harmonic's Tudor Achim maintains that formal verification is the key epistemological shift and that AI could reach the Riemann Hypothesis by 2028 [16]. Gary Marcus has explicitly checked whether AI math headlines match underlying results [17], occupying the most visible skeptical position in an otherwise enthusiastic coverage landscape.

A competitive dimension complicates the story. Reports that OpenAI was 'under fire' as DeepMind claimed the IMO gold [18] have been overtaken by evidence that both organizations achieved gold-medal performance at IMO 2025 [7][6], with Google DeepMind also mentioning a new system, AlphaProof Nexus [19], suggesting ongoing capability development. The r/math community is actively auditing Tao's earlier AI predictions against 2026 reality [20], while r/singularity and Hacker News have fragmented along enthusiasm-versus-interrogation lines [21][22].

Timeline

1946-01-01: Paul Erdős poses the unit distance conjecture in discrete geometry [1]
2024-07-01: DeepMind's AlphaProof and AlphaGeometry 2 earn silver medal at IMO 2024 [31]
2025-07-21: NYT reports Google AI wins gold at IMO 2025; OpenAI also claims gold-medal performance—first AI systems to clear this threshold [5][6][7]
2025-08-03: Xena Project publishes mathematical community round-up of AI performance at IMO 2025 [8]
2026-02-01: The Atlantic publishes longform profile on how Terence Tao uses AI in his research [10]
2026-05-20: OpenAI announces unreleased reasoning model disproved Erdős unit distance conjecture; Harmonic podcast on Aristotle and formal verification published [1][32][16]
2026-05-21: Gil Kalai acknowledges disproof as 'Amazing' on his blog; widespread media amplification including The Guardian [3][33][34][35]
2026-05-22: arXiv preprint 'Remarks on the disproof' appears; The Neuron analysis published; physics and science aggregators amplify [2][1][13][36]
2026-05-23: Reports that three Erdős problems fell within seven days with Tao verifying each; Gary Marcus publishes critical review of AI math headlines [11][17]
2026-05-24: Tao's GitHub wiki tracking AI contributions confirmed; Tao quoted on 'mass-produced mathematics at scale'; Scientific American covers Erdős breakthrough; AlphaProof Nexus mentioned [9][12][4][19]

Perspectives

OpenAI

A general-purpose reasoning model with no mathematical specialization disproved an 80-year-old open conjecture, and OpenAI is confirmed to have achieved gold-medal-level performance at IMO 2025—demonstrating mathematical discovery capability across both open-problem and competition domains.

Evolution: The narrative of OpenAI being 'under fire' as DeepMind claimed IMO gold has been complicated by evidence that OpenAI also achieved IMO 2025 gold [21034][21035], substantially strengthening its competitive position relative to the prior framing.

[1][2][4][6][7][18]

Google DeepMind

Achieved gold-medal performance at IMO 2025 with Gemini Deep Think, maintains a Lean-grounded theorem-proving architecture where every reasoning step is formally verified, and has introduced AlphaProof Nexus as a further development.

Evolution: IMO gold now confirmed across multiple outlets including NYT and the Xena Project mathematical-community round-up; AlphaProof Nexus is a newly mentioned system suggesting continued capability development.

[13][5][7][8][19][14][23]

Terence Tao

Actively curates AI contributions on his Erdős problems GitHub wiki and has been profiled on his AI workflow, but has also publicly expressed concern about 'more and more mass-produced mathematics at scale'—a quality warning alongside capability recognition.

Evolution: The 'mass-produced mathematics' quote [18631] adds a critical dimension absent from earlier framing; Tao now occupies a nuanced dual position as engaged participant and quality skeptic rather than purely a validator.

[9][10][11][12][20]

Harmonic (Tudor Achim)

Formal verification—machine-checkable proofs in Lean—is the key epistemological shift; AI could reach the Riemann Hypothesis by 2028; near-term software and hardware applications are already within reach.

Evolution: Consistent and promotional; no new substantive claims beyond the founding thesis.

[16][24][25][26][27]

Gary Marcus (AI critic)

Explicitly checking whether AI math headlines match underlying results; skeptical of headline claims about OpenAI and Anthropic mathematical achievements.

Evolution: Consistent; remains the most prominent skeptical voice in an otherwise enthusiastic coverage landscape.

[17]

Rohan Paul (AI commentator)

AI formal math success operates in 'carefully constrained worlds'; now actively connects Tao's 'mass-produced mathematics' observation to an optimistic framing—AI turns proof-writing into a search problem over thousands of candidate lemmas filtered by cheap automated checkers.

Evolution: Evolved to engage directly with Tao's quality concern [18631], adding a constructive reframe to his previously constraint-only position.

[13][14][12]

Mathematical community (r/math, Hacker News, Xena Project, LessWrong)

Fragmented: r/math audits Tao's earlier predictions against 2026 reality; Xena Project provides formal-verification-community context for IMO results; LessWrong acknowledges both Google and OpenAI at gold; HN probes what Lean verification actually guarantees.

Evolution: Xena Project and LessWrong are newly prominent voices providing mathematical-community and rationalist-community framings; the UW erroneous Lean proofs dataset [21027] has added technical evidence to HN's skeptical interrogation of formal verification.

[20][8][7][22][28][29][15]

Tensions

OpenAI's Erdős disproof came from a general-purpose model with no formal-system grounding, while DeepMind's architecture requires every reasoning step to be Lean-verified—these represent fundamentally different claims about what makes AI mathematical results trustworthy, and both organizations now hold IMO gold. [1][13][14][2][6][7]
Tao's observation of 'mass-produced mathematics at scale' [18631] sits in tension with Rohan Paul's optimistic framing that AI proof-search at scale is precisely the mechanism for mathematical progress—quality concern versus capability enablement, from the same observation. [12][13]
The UW Math AI Lab's dataset of erroneous Lean proofs [21027] challenges the 'every step formally verified' claim central to DeepMind's and Harmonic's architectures—if the verification layer itself produces false proofs, formal-system grounding may be less reliable than claimed. [15][13][16][14]
Gary Marcus is explicitly checking whether AI math headlines match underlying results [12963], positioned against enthusiastic coverage in Scientific American, Ars Technica, and Simon Willison's blog, which have treated the OpenAI and DeepMind results as genuine breakthroughs. [17][1][4][23][30]
Tudor Achim predicts AI could prove the Riemann Hypothesis by 2028 [7662], a timeline sitting in stark unresolved tension with the cautious constraint-emphasizing framing of DeepMind's published research and the skeptical register of Gary Marcus's critique. [16][13][17]
Both OpenAI and DeepMind now claim IMO 2025 gold [21034][21035], raising the question of whether their results are comparably achieved—different approaches, different problems solved, or different scoring thresholds—and which represents the more epistemically meaningful demonstration. [6][7][1][13]

Sources

[1] 😸 OpenAI solved an 80-year math problem by... disproving it — The Neuron (2026-05-22)
[2] Remarks on the disproof of the unit distance conjecture - arXiv — reactive:openai-erdos-math-breakthrough
[3] Amazing: Erdős' Unit Distance Problem was Disproved! It was ... — reactive:openai-erdos-math-breakthrough
[4] OpenAI announces AI's biggest math breakthrough yet — reactive:openai-erdos-math-breakthrough
[5] Google A.I. System Wins Gold Medal in International Math Olympiad — reactive:ai-formal-math-breakthroughs
[6] OpenAI claims gold-medal performance at IMO 2025 | Hacker News — reactive:ai-formal-math-breakthroughs
[7] Google and OpenAI Get 2025 IMO Gold - LessWrong — reactive:ai-formal-math-breakthroughs
[8] AI at IMO 2025: a round-up - Xena Project - WordPress.com — reactive:ai-formal-math-breakthroughs
[9] AI contributions to Erdős problems · teorth/erdosproblems Wiki — reactive:ai-formal-math-breakthroughs
[10] The Edge of Mathematics - The Atlantic — reactive:ai-formal-math-breakthroughs
[11] Three Erdős Problems Fell in Seven Days, and Terence Tao Verified ... — reactive:ai-formal-math-breakthroughs
[12] “I do see more and more mass-produced mathematics at scale." — Rohan Paul Twitter (2026-05-24)
[13] Google DeepMind's new paper. — Rohan Paul Twitter (2026-05-22)
[14] @tomflex @prz_chojecki Sure! DeepMind built AI agents that pair LLMs (for generating ideas) with the Lean theorem prover... — reactive:ai-formal-math-breakthroughs (2026-05-24)
[15] UW Math AI Lab Releases Erroneous Lean Proofs Dataset - LinkedIn — reactive:ai-formal-math-breakthroughs
[16] 😺 🎙️ PODCAST: Can AI Solve Math's Biggest Mystery? — The Neuron (2026-05-20)
[17] Checking the math behind OpenAI and Anthropic's latest headlines — reactive:ai-formal-math-breakthroughs
[18] Google Takes the Gold. OpenAI under fire. - YouTube — reactive:ai-formal-math-breakthroughs
[19] @tobiamure @Polymarket This is today's big news from Google DeepMind: their new AI agent (AlphaProof Nexus) autonomously... — reactive:ai-formal-math-breakthroughs (2026-05-24)
[20] Now that it's 2026, how is Terence Tao's prediction holding up? : r/math — reactive:openai-erdos-math-breakthrough
[21] Gemini Deep Think achieved Gold at IMO : r/singularity - Reddit — reactive:ai-formal-math-breakthroughs
[22] Gemini with Deep Think achieves gold-medal standard at the IMO | Hacker News — reactive:ai-formal-math-breakthroughs
[23] Gemini Deep Think learns math, wins gold medal at International Math Olympiad - Ars Technica — reactive:ai-formal-math-breakthroughs
[24] [PDF] Aristotle: IMO-level Automated Theorem Proving - arXiv — reactive:ai-formal-math-breakthroughs
[25] Aristotle from Harmonic just proved Erdos Problem #124 in Lean all ... — reactive:ai-formal-math-breakthroughs
[26] Harmonic — reactive:ai-formal-math-breakthroughs
[27] Harmonics Proves a Tough Mathematics Problem. — reactive:ai-formal-math-breakthroughs
[28] I would say that there is very little danger of a proof in Lean being ... — reactive:ai-formal-math-breakthroughs
[29] Thoughts on LEAN, the proof checker : r/math - Reddit — reactive:ai-formal-math-breakthroughs
[30] Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad — reactive:ai-formal-math-breakthroughs
[31] Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2 ... — reactive:ai-formal-math-breakthroughs
[32] OpenAI claims it solved an 80-year-old math problem - TechCrunch — reactive:ai-formal-math-breakthroughs
[33] OpenAI makes breakthrough on 80-year-old maths problem — reactive:openai-erdos-math-breakthrough
[34] 🚨 OPENAI MATH BREAKTHROUGH 🚨 — reactive:ai-formal-math-breakthroughs (2026-05-21)
[35] OpenAI's internal model disproves Unit Distance Conjecture of Erdos — reactive:openai-erdos-math-breakthrough
[36] AI makes a major breakthrough in a math problem that had stumped experts for decades — reactive:openai-erdos-math-breakthrough