Rapid AI Benchmark Improvement: Small Models and New Entrants Closing Capability Gaps
What
Two Chinese AI labs released models in mid-June 2026 that have drawn attention for closing parts of the gap to closed frontier systems.
GLM-5.2, from Zhipu AI, scores near Claude Opus 4.7 on traditional benchmarks and has been called the strongest available open-weights model [1]; on ARC-AGI-2 it reached 22.8% at $0.25 per task, compared to GPT-5.5's leading 85% [4].
VibeThinker-3B, from Weibo AI, reports 94.3 on AIME26 and 80.2 Pass@1 on LiveCodeBench v6 using only 3 billion parameters—numbers that approach Claude Opus 4.5 on reasoning benchmarks [5].
Both releases are contested: the strongest analytical case against GLM-5.2 is that it appears heavily distilled from Claude Opus and likely overperforms on benchmarks relative to its generalization [1], while VibeThinker-3B's claim of frontier-level reasoning from a tiny model has not yet been independently stress-tested beyond verifiable math and coding tasks.
Why it matters
If post-training techniques like SFT+GRPO can compress frontier-adjacent performance into a 3B model, the cost of capable AI on narrow tasks drops materially. The distillation question is the key caveat: models that mimic closed-source outputs tend to generalize less well to tasks outside typical benchmark distributions, which limits what the benchmark numbers actually promise.
Open questions
Is GLM-5.2's benchmark performance a product of distillation from Claude Opus, and if so, how does it perform on tasks that fall outside typical training distributions? [1]
VibeThinker-3B's AIME26 and LeetCode numbers are striking for its size [5]—do they hold up on genuinely novel reasoning problems not covered by its training data?
ARC-AGI-2 went from a best verified score of 3.0% in May 2025 to GPT-5.5 at 85% by June 2026 [4]—is the benchmark still a meaningful discriminator between frontier and sub-frontier models?
Does GLM-5.2's commercial case hold if it is not cheap enough for bulk tasks and not strong enough for the hardest tasks, relative to closed models at comparable cost? [1]
Narrative
Two Chinese AI labs released notable models in mid-June 2026. GLM-5.2, from Zhipu AI, quickly circulated as the strongest available open-weights language model, scoring near Claude Opus 4.7 on traditional benchmarks [1]. On coding evaluations, some observers reported it outperforming GPT-5.5 [2], and on the DeepSWE software engineering benchmark it was added alongside updated results for other models [3]. On ARC-AGI-2—a benchmark of abstract visual reasoning—GLM-5.2 reached 22.8% at $0.25 per task [4]. The best verified models on ARC-AGI-2 stood at only 3.0% as recently as May 2025; GPT-5.5 currently leads the leaderboard at 85% [4], so GLM-5.2 sits well below the frontier on that dimension even as it leads among open-weights models elsewhere.
VibeThinker-3B, released by Weibo AI (Sina Weibo's AI subsidiary), is a dense 3-billion-parameter model built on Qwen2.5-Coder-3B using a post-training pipeline the authors call Spectrum-to-Signal, combining supervised fine-tuning and group relative policy optimization (GRPO) [5][6]. Its reported benchmarks are 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6, and 96.1% acceptance on recent unseen LeetCode problems [5]. These numbers approach Claude Opus 4.5, a model estimated to be orders of magnitude larger in parameter count [7]. The model is MIT-licensed [8], and its paper appeared on arxiv alongside coverage from Marktechpost and others [9][10]. The central claim is that targeted post-training on verifiable reasoning tasks can embed performance normally associated with much larger models into a 3B footprint.
The most detailed published analysis of GLM-5.2 comes from Zvi Mowshowitz, who credits it as a genuine advance while arguing it is very likely heavily distilled from Claude Opus—pointing to its strong tendency to self-identify as Claude and its use of a Claude harness. He notes that distilled models typically overperform on benchmarks and common tasks while underperforming on less common ones, and concludes the model occupies an awkward commercial position: not cheap enough for bulk tasks and not strong enough for the hardest tasks relative to closed alternatives at comparable cost [1]. A separate observer who typically distrusts benchmark numbers reported finding GLM-5.2 genuinely impressive in direct use [11], and Julian Goldie SEO claimed it outperformed Opus 4.8 in a live build-off [12]—though he separately documented cases where top-benchmark models performed worst in his own real-world tests [13].
The broader debate running through both releases is whether benchmark numbers, especially from models that may be distilled, reflect the kind of capability that matters in practice. Thyago Liberalli warned against evaluating any model on a single benchmark [14], while multiple users noted that the DeepSWE benchmark results aligned with their own experience [15][16]. The tension between benchmark standings and real-world task performance is the consistent undercurrent across both discussions.
Timeline
- 2025-05: Best verified models on ARC-AGI-2 scored only 3.0%. [4]
- 2026-06-18: Weibo AI releases VibeThinker-3B, a 3B dense reasoning model built on Qwen2.5-Coder-3B using the Spectrum-to-Signal SFT+GRPO post-training pipeline. [19][20]
- 2026-06-18: GLM-5.2 full benchmark results published by Zhipu AI, showing strong coding and reasoning performance. [21][22]
- 2026-06-19: VibeThinker-3B paper posted to arxiv; Marktechpost covers benchmark claims. [9][10]
- 2026-06-19: Zixuan Li claims GLM-5.2 delivers a leap in app development and long-horizon tasks. [17]
- 2026-06-21: DeepSWE benchmark updated to include GLM-5.2 results alongside updated scores for other models. [3]
- 2026-06-22: Zvi Mowshowitz publishes analysis calling GLM-5.2 the new best open model but arguing it is heavily distilled from Claude Opus and commercially awkward. [1]
- 2026-06-22: Julian Goldie SEO claims GLM-5.2 outperformed Opus 4.8 in a live coding build-off. [12]
- 2026-06-23: A normally benchmark-skeptical observer (bendee983) reports finding GLM-5.2 genuinely impressive in hands-on use; comparison to R1's moment circulates. [11][23]
- 2026-06-24: GLM-5.2 measured at 22.8% on ARC-AGI-2 at $0.25 per task, versus GPT-5.5's leading 85% and a May 2025 frontier of 3.0%. [4]
- 2026-06-24: Rohan Paul reports VibeThinker-3B achieves 94.3 on AIME26 and 80.2 on LiveCodeBench v6, nearly matching Opus 4.5 on reasoning. [5]
Perspectives
Zvi Mowshowitz
GLM-5.2 is the best available open-weights model and a real achievement, but is very likely heavily distilled from Claude Opus, tends to overperform on benchmarks relative to generalization, and occupies a commercially awkward niche below closed frontier models.
Evolution: Consistent analytical skepticism; credits the release while systematically narrowing the scope of the claim.
Rohan Paul
Both GLM-5.2 and VibeThinker-3B represent meaningful advances; presents benchmark data without strong editorial endorsement or dismissal.
Evolution: Neutral-analytical across both releases.
Zixuan Li (Zhipu AI / GLM team)
GLM-5.2 delivers a meaningful improvement in app development and long-horizon task capabilities.
Evolution: Promotional stance consistent with lab affiliation.
bendee983
Normally skeptical of AI benchmarks but finds GLM-5.2 genuinely impressive, including in direct use beyond the numbers.
Evolution: A skeptic who updated positively after hands-on testing.
Julian Goldie SEO
GLM-5.2 outperformed Opus 4.8 in a live build-off, but separately argues top-benchmark models often come last in real-world tests.
Evolution: Enthusiastic on GLM-5.2 in head-to-head comparison while skeptical of benchmark rankings as a general guide.
Thyago Liberalli
Evaluating models on a single benchmark is a significant error; performance must be assessed across multiple dimensions.
Evolution: Consistent skepticism of single-metric evaluation.
Tensions
- Zvi Mowshowitz argues GLM-5.2 is heavily distilled from Claude Opus and likely generalizes poorly outside benchmark-like tasks; bendee983 and Julian Goldie treat its performance as reflecting genuine capability after direct testing. [1][11][12]
- Mowshowitz calls GLM-5.2 the best open-weights model but simultaneously argues it is not cheap enough for bulk tasks and not strong enough for hard tasks, making the 'best open model' label of limited practical value. [1]
- Julian Goldie SEO found GLM-5.2 superior to Opus 4.8 in a live build-off, but in a separate test found that the 'best' benchmark model came last in his own real-world evaluation—pointing to instability between leaderboard rank and task-specific performance. [12][13]
- VibeThinker-3B's benchmark numbers approach Claude Opus 4.5 on math and coding, but whether a 3B model can match a vastly larger model on tasks outside its training distribution rather than benchmark-style problems remains untested. [5][7]
- ARC-AGI-2's rapid progression from 3.0% (May 2025) to GPT-5.5 at 85% (June 2026) invites the question of whether the benchmark still distinguishes meaningfully at the top, while GLM-5.2's 22.8% shows a persistent large gap to the current leader. [4]
Status: active and growing
Sources
- [1] GLM-5.2 Is The New Best Open Model — Zvi's AI Roundups (2026-06-22)
- [2] 🚨 GLM 5.2 OUTPERFORMS GPT-5.5 IN CODING BENCHMARKS — reactive:ai-benchmark-race (2026-06-22)
- [3] DeepSWE Benchmark updated with GLM 5.2 and updated results for other models — reactive:ai-benchmark-race (2026-06-21)
- [4] GLM-5.2 got 22.8% on ARC-AGI-2:, $0.25/task — Rohan Paul Twitter (2026-06-24)
- [5] VibeThinker is a 3B param model, with almost head to head benchmark result with Opus 4.5 on reasoning with novel SFT+GRP… — Rohan Paul Twitter (2026-06-24)
- [6] VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline ... — reactive:ai-benchmark-race (2026-06-21)
- [7] 1/ A 3 billion parameter model just beat Opus 4.5 (1T+ params) on math reasoning. — reactive:ai-benchmark-race (2026-06-23)
- [8] RT @Marktechpost: 🔥 VibeThinker-3B is a 3B open-source (MIT) reasoning model that reaches the band of systems hundreds o... — reactive:ai-benchmark-race (2026-06-19)
- [9] [2606.16140] VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models — reactive:ai-benchmark-race
- [10] VibeThinker-3B: A 3B Dense Reasoning Model Built on ... — reactive:ai-benchmark-race
- [11] RT @bendee983: I'm usually very skeptical of AI benchmarks, but GLM-5.2 is really impressive. Aside from the benchmark r... — reactive:ai-benchmark-race (2026-06-23)
- [12] GLM 5.2 JUST HUMILIATED OPUS 4.8 IN A LIVE BUILD-OFF — reactive:ai-benchmark-race (2026-06-19)
- [13] THE “BEST” AI MODEL CAME LAST IN MY REAL-WORLD TEST. — reactive:ai-benchmark-race (2026-06-24)
- [14] One of the biggest mistakes people make when evaluating LLMs is looking at a single benchmark and assuming it tells the ... — reactive:ai-benchmark-race (2026-06-23)
- [15] RT @bdsqlsz: Deepswe's benchmark results are my own experience. — reactive:ai-benchmark-race (2026-06-24)
- [16] RT @bdsqlsz: Deepswe's benchmark results are my own experience. — reactive:ai-benchmark-race (2026-06-24)
- [17] GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks. — reactive:ai-benchmark-race (2026-06-19)
- [18] RT @bendee983: I'm usually very skeptical of AI benchmarks, but GLM-5.2 is really impressive. Aside from the benchmark r... — reactive:ai-benchmark-race (2026-06-23)
- [19] RT @WeiboLLM: ⭐ VibeThinker-3B is released — a dense 3B model for frontier-level verifiable reasoning. — reactive:ai-benchmark-race (2026-06-18)
- [20] RT @ModelScope2022: Meet VibeThinker-3B, a 3B reasoning model from Weibo AI focused on math, coding, and STEM reasoning.... — reactive:ai-benchmark-race (2026-06-18)
- [21] RT @ValsAI: Full results for GLM 5.2 are here! — reactive:ai-benchmark-race (2026-06-18)
- [22] RT @ValsAI: Full results for GLM 5.2 are here! — reactive:ai-benchmark-race (2026-06-18)
- [23] @AndrewCurran_ What GLM 5.2 did to the best of the best is as big of a breakthrough as the R1 improvement purely from th... — reactive:ai-benchmark-race (2026-06-23)