GLM 5.2 just took the top spot on PostTrainBench by scoring 34.29%.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-26
GLM 5.2 achieves the highest agent score of 34.29% on PostTrainBench, a benchmark that measures whether an AI agent can autonomously improve four base language models using one H100 GPU within ten hours.
Appears in
Extraction
Topics: llm-benchmarksai-agentspost-trainingglmautomated-machine-learning
Claims
- GLM 5.2 scored 34.29% on PostTrainBench, the highest score among agent models tested.
- PostTrainBench evaluates an AI agent's ability to select training data, write training code, run fine-tuning experiments, and submit improved models within a 10-hour, single-GPU constraint.
- Official human-built instruct models score 51.14% on PostTrainBench, revealing a significant capability gap above agent-controlled pipelines.
- GLM 5.2 was evaluated as the orchestrating agent rather than the model being trained—its score reflects training automation ability, not raw question-answering performance.
Key quotes
PostTrainBench tests whether an AI agent can take a raw LLM and make it better by actually training it, not by answering the benchmark questions itself.
The gap to official instruct models, which score 51.14%, still shows how far agents are from mature post-training pipelines built with more data, compute, and human tuning.