The Information Machine

AI Agents Underperform Real-World Tasks: CAPTCHAs, Expert Benchmarks, and Memory Quality Failures

Synthesis history

2 versions, newest first.

  1. Version 2 2026-06-15 08:31 UTC · 17 items

    Three new items from June 14 extend the thread's failure-mode taxonomy. The most significant addition is a University of Texas study showing agents degrade in reliability post-deployment as context accumulates — a dynam…

  2. Version 1 2026-06-14 02:20 UTC · 12 items

    Three research benchmarks published in June 2026 show AI agents failing consistently at tasks where conventional benchmark scores imply competence. The Agents' Last Exam benchmark finds frontier agents score under 10% o…