The Information Machine

GLM 5.2 just took the top spot on PostTrainBench by scoring 34.29%.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-26

GLM 5.2 achieves the highest agent score of 34.29% on PostTrainBench, a benchmark that measures whether an AI agent can autonomously improve four base language models using one H100 GPU within ten hours.

Open original ↗

Appears in

Extraction

Topics: llm-benchmarksai-agentspost-trainingglmautomated-machine-learning

Claims

  • GLM 5.2 scored 34.29% on PostTrainBench, the highest score among agent models tested.
  • PostTrainBench evaluates an AI agent's ability to select training data, write training code, run fine-tuning experiments, and submit improved models within a 10-hour, single-GPU constraint.
  • Official human-built instruct models score 51.14% on PostTrainBench, revealing a significant capability gap above agent-controlled pipelines.
  • GLM 5.2 was evaluated as the orchestrating agent rather than the model being trained—its score reflects training automation ability, not raw question-answering performance.

Key quotes

PostTrainBench tests whether an AI agent can take a raw LLM and make it better by actually training it, not by answering the benchmark questions itself.
The gap to official instruct models, which score 51.14%, still shows how far agents are from mature post-training pipelines built with more data, compute, and human tuning.