The Information Machine

Latest open artifacts (#21): Open model bonanza! Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others. On CAISI's V4 assessment.

Interconnects · Florian Brand · 2026-05-16

Interconnects newsletter surveys the May 2026 wave of open-weight AI model releases—including DeepSeek V4, Gemma 4, Kimi K2.6, MiMo-V2.5-Pro, and GLM-5.1—and critically examines a CAISI report claiming open models are falling further behind closed frontier models.

Open original ↗

Appears in

Extraction

Topics: open-source-llmsmodel-benchmarksopen-closed-capability-gapmodel-releases

Claims

  • CAISI's evaluation concludes that open models lag closed frontier models and the capability gap is widening, based on an IRT-derived Elo score across nine benchmarks.
  • CAISI's methodology likely exaggerates the gap because benchmarks use simple bash-loop setups rather than the agentic harnesses—such as Claude Code—that frontier models are actually trained to use.
  • Epoch AI's ECI metric shows a more stable 3–7 month capability lag between open and closed models since R1, painting a less alarming picture than CAISI's Elo scores.
  • DeepSeek V4 Flash outperforms V4 Pro relative to their respective sizes, making Flash the more noteworthy of the two new DeepSeek releases.
  • Multiple new open models—including Gemma 4 (now Apache 2.0), Kimi K2.6, and MiMo-V2.5-Pro—are competitive with other flagship open models across benchmarks and real-world use.

Key quotes

Coding tasks are evaluated using access to bash and a for-loop with a fixed budget of tokens, not with a harness such as Claude Code or OpenCode, which models are trained in!
An interesting dynamic within Interconnects is that Florian believes more in the proximity of open frontier models to closed alternatives in true performance. Nathan thinks the benchmarks are imperfect as well, but thinks the closed models are ahead by more.
A frontier comparison of open and closed models would also need to elicit the capabilities of all models better, which means the usage of the preferred harnesses, as well as model-specific prompting.