Introducing GeneBench-Pro

OpenAI Blog · 2026-06-30

OpenAI introduces GeneBench-Pro, a 129-problem synthetic benchmark testing AI agents on judgment-heavy computational biology tasks, where GPT-5.6 Sol achieves a 31.5% pass rate versus below 5% for GPT-5, with human experts estimated to require 20-40 hours per problem at ~$200/hour.

Open original ↗

Appears in

OpenAI Launches GeneBench-Pro: Expert-Level Genomics Benchmark for Frontier AI

Extraction

Topics: ai-benchmarkscomputational-biologyscientific-reasoningllm-evaluationtest-time-compute

Claims

GeneBench-Pro contains 129 synthetically generated computational biology problems requiring higher-order judgment on data quality, analysis path selection, and iterative assumption revision.
GPT-5.6 Sol achieves a 31.5% pass rate with Pro mode enabled, compared to below 5% for GPT-5, indicating rapid frontier capability progress on scientific reasoning.
Scaling test-time compute yields large returns: GPT-5.6 Sol at its highest reasoning level solves nearly six times as many problems as GPT-5.2 while using approximately two-thirds the tokens.
Open-source models underperform GPT models significantly more on GeneBench-Pro than on coding benchmarks, suggesting GPT models have broader general scientific reasoning beyond coding specialization.
Human experts require approximately 20-40 hours per problem at ~$200/hour while AI inference costs only several dollars per problem, creating a large economic opportunity even at partial reliability.
Synthetic data generation with fully known causal structures enables deterministic grading and prevents analytic shortcuts or arbitrary answer-path ambiguities that plague other long-horizon biology benchmarks.

Key quotes

The problems I reviewed would have been challenging for a graduate student to complete without iterated feedback from an experienced supervisor.

At the current pace, this benchmark may be saturated by the end of the year.

Models can make partial progress on challenging problems, but they struggle to close the inferential loop. This failure pattern mirrors the contrast between human experts and novices.