New Agents' Last Exam benchmark finds frontier AI agents score under 10% on realistic expert tasks · Digg
reactive:ai-agent-benchmark-reality-gap
(No summary yet for this item — extraction summaries are still backfilling.)
reactive:ai-agent-benchmark-reality-gap
(No summary yet for this item — extraction summaries are still backfilling.)