Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-11
A new benchmark called "Agents' Last Exam" tests AI agents on real expert-level work and finds that even the strongest frontier agents consistently fail, exposing a large gap between benchmark scores and practical automation readiness.
Appears in
Extraction
Topics: ai-agentsbenchmarksreal-world-automationfrontier-models
Claims
- Current frontier AI agents perform far worse on real expert tasks than their benchmark scores imply.
- A new benchmark, "Agents' Last Exam," is designed to test agents on completing genuine expert-level work.
- Even the strongest available agents fail most tasks on this benchmark.
- Existing benchmarks systematically overstate agent readiness for real-world automation.
Key quotes
Today's frontier agents are far less ready for real-world automation than their benchmark scores suggest.
Even strong agents of today are nowhere [near completing real expert work].