The Information Machine

Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-11

A new benchmark called "Agents' Last Exam" tests AI agents on real expert-level work and finds that even the strongest frontier agents consistently fail, exposing a large gap between benchmark scores and practical automation readiness.

Open original ↗

Appears in

Extraction

Topics: ai-agentsbenchmarksreal-world-automationfrontier-models

Claims

  • Current frontier AI agents perform far worse on real expert tasks than their benchmark scores imply.
  • A new benchmark, "Agents' Last Exam," is designed to test agents on completing genuine expert-level work.
  • Even the strongest available agents fail most tasks on this benchmark.
  • Existing benchmarks systematically overstate agent readiness for real-world automation.

Key quotes

Today's frontier agents are far less ready for real-world automation than their benchmark scores suggest.
Even strong agents of today are nowhere [near completing real expert work].