Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

LessWrong (Curated) · Anders Cairns Woodruff · 2026-06-10

Redwood Research and collaborators measure frontier AI models' ability to complete tasks without chain-of-thought reasoning, finding GPT-5.5 handles ~3-minute human tasks at 50% reliability, with this capability doubling roughly every year since 2019 — raising concerns about the future effectiveness of CoT-based safety monitoring.

Open original ↗

Appears in

Frontier AI Safety Evaluation: Scheming Research and Evaluation Standards

Extraction

Topics: chain-of-thoughtai-safetymodel-evaluationhidden-reasoningcapability-trends

Claims

GPT-5.5 can complete tasks requiring approximately 3 minutes of human effort at 50% success rate without any chain-of-thought reasoning.
No-CoT task completion capability has doubled approximately every 373 days since 2019, roughly half the rate of with-CoT capability growth since GPT-4.
Most capability gains since GPT-4 have come from externalized reasoning rather than single-forward-pass reasoning, causing with-CoT and no-CoT time horizons to diverge.
Extrapolating current trends, models could achieve the equivalent of 25 minutes of no-CoT reasoning by 2030, potentially enabling more complex subversion or scheming.
CoT monitoring as a safety technique may become substantially less effective in the next few years even without steganography.

Key quotes

Three minutes of human-equivalent reasoning might be enough for some single-shot decisions: whether to comply with an instruction, whether a scenario looks like an eval, or which of several visible options to take.

Even without neuralese, CoT monitoring might become substantially less effective in the next few years.

We suggest that AI companies start to track no-CoT THs explicitly to find a lower bound on how much reasoning a model could do without revealing it to a CoT monitor or human.