GLM 5.2 just took the top spot on PostTrainBench by scoring 34.29%.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-26

GLM 5.2 achieves the highest agent score of 34.29% on PostTrainBench, a benchmark that measures whether an AI agent can autonomously improve four base language models using one H100 GPU within ten hours.

Open original ↗

Appears in

Rapid AI Benchmark Improvement: Small Models and New Entrants Closing Capability Gaps

Extraction

Topics: llm-benchmarksai-agentspost-trainingglmautomated-machine-learning

Claims

GLM 5.2 scored 34.29% on PostTrainBench, the highest score among agent models tested.
PostTrainBench evaluates an AI agent's ability to select training data, write training code, run fine-tuning experiments, and submit improved models within a 10-hour, single-GPU constraint.
Official human-built instruct models score 51.14% on PostTrainBench, revealing a significant capability gap above agent-controlled pipelines.
GLM 5.2 was evaluated as the orchestrating agent rather than the model being trained—its score reflects training automation ability, not raw question-answering performance.

Key quotes

PostTrainBench tests whether an AI agent can take a raw LLM and make it better by actually training it, not by answering the benchmark questions itself.

The gap to official instruct models, which score 51.14%, still shows how far agents are from mature post-training pipelines built with more data, compute, and human tuning.