Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-11

A new benchmark called "Agents' Last Exam" tests AI agents on real expert-level work and finds that even the strongest frontier agents consistently fail, exposing a large gap between benchmark scores and practical automation readiness.

Open original ↗

Appears in

AI Agents Underperform Real-World Tasks: CAPTCHAs, Expert Benchmarks, and Memory Quality Failures

Extraction

Topics: ai-agentsbenchmarksreal-world-automationfrontier-models

Claims

Current frontier AI agents perform far worse on real expert tasks than their benchmark scores imply.
A new benchmark, "Agents' Last Exam," is designed to test agents on completing genuine expert-level work.
Even the strongest available agents fail most tasks on this benchmark.
Existing benchmarks systematically overstate agent readiness for real-world automation.

Key quotes

Today's frontier agents are far less ready for real-world automation than their benchmark scores suggest.

Even strong agents of today are nowhere [near completing real expert work].