New Agents' Last Exam benchmark finds frontier AI agents score under 10% on realistic expert tasks · Digg

reactive:ai-agent-benchmark-reality-gap

(No summary yet for this item — extraction summaries are still backfilling.)

Appears in