Introducing LifeSciBench

OpenAI Blog · 2026-06-17

OpenAI introduces LifeSciBench, a 750-task expert-written benchmark spanning seven life science research workflows built with 173 PhD-level scientists, finding that frontier model GPT-Rosalind achieves only a 36.1% overall pass rate on realistic pharmaceutical and biotech research tasks.

Open original ↗

Appears in

OpenAI GPT-Rosalind: Specialized Biology Model with Biodefense Gating

Extraction

Topics: ai-benchmarkslife-sciences-aidrug-discoveryscientific-aimodel-evaluation

Claims

Current AI benchmarks for life sciences fail to capture the complexity of real pharmaceutical and biotech research, focusing instead on narrow or isolated skills with clean reference answers.
GPT-Rosalind achieves a 36.1% overall pass rate on LifeSciBench compared to GPT-5.5's 25.7%, indicating meaningful but still limited progress.
Frontier models perform substantially worse on artifact-heavy tasks, with pass rates dropping from 45.1% on text-only tasks to 28.1% on tasks involving figures, PDFs, or sequence files.
Models show the strongest relative improvement in scientific communication and bench-to-bedside translation tasks, suggesting frontier models are advancing fastest in structured synthesis.
Exact-answer tasks requiring numeric calculations, sequences, or molecular structures remain a major weakness, with GPT-Rosalind achieving only 14.8% pass rate on numeric tasks.

Key quotes

Current benchmarks do not fully capture these capabilities. Many life science evaluations focus on narrow domains or isolated skills, resulting in questions with structured question formats and clean reference answers.

Strong performance on LifeSciBench should therefore be interpreted as evidence of realistic task-level capability, not as a direct measure of downstream research impact.

In roughly 14% of tasks, models earned substantial rubric credit despite failing the exact-pass threshold... models may identify relevant evidence or produce a plausible partial answer, but still fail because they miss a key constraint, use the wrong evidence, make an incomplete calculation, or do not connect their reasoning to a scientifically useful final decision.