AI Agents Underperform Real-World Tasks: CAPTCHAs, Expert Benchmarks, and Memory Quality Failures

Synthesis history

6 versions, newest first.

Version 6 2026-06-26 18:22 UTC · 43 items

Three substantive new items extend the failure-mode catalog. Rohan Paul added RAG hallucination (June 25) — LLMs invent answers even when constrained to supplied documents — as a seventh structural failure mode and a ne…
Version 5 2026-06-24 02:22 UTC · 38 items

Two substantive new items extend the memory and generalization failure story. Rohan Paul reported (June 19, 22) that generalist agents must remember hidden environment rules — not just observable states — and that LLMs …
Version 4 2026-06-19 18:23 UTC · 32 items

Two substantive new findings this pass. Rohan Paul reported a solve/judge dissociation: frontier models can solve math correctly but cannot evaluate the validity of reasoning in others' solutions, adding a meta-cognitiv…
Version 3 2026-06-15 18:21 UTC · 26 items

The new items this pass are all thin — mostly undated commercial blog posts and practitioner guides about 'agent drift' and 'context drift' with no claims or sourced findings, plus two content-free tweets. None introduc…
Version 2 2026-06-15 08:31 UTC · 17 items

Three new items from June 14 extend the thread's failure-mode taxonomy. The most significant addition is a University of Texas study showing agents degrade in reliability post-deployment as context accumulates — a dynam…
Version 1 2026-06-14 02:20 UTC · 12 items

Three research benchmarks published in June 2026 show AI agents failing consistently at tasks where conventional benchmark scores imply competence. The Agents' Last Exam benchmark finds frontier agents score under 10% o…