Research: Agent Performance Is a Systems Problem, Not a Prompts Problem · history

Version 3

2026-05-25 10:50 UTC · 58 items

Changes since v2

The most significant development this pass is the institutional formalization of 'context engineering': an arXiv paper[^19665] and multiple enterprise frameworks[^19666][^19668] now anchor the term beyond practitioner preference, giving it the kind of academic and industry documentation that typically precedes widespread adoption. Meta's ARE paper is receiving substantially broader amplification than previously tracked, with explainers, podcasts, and researcher commentary[^19669][^19670][^19672][^19673][^19674][^19675] signaling the evaluation-infrastructure argument is reaching practitioner audiences. Warp's publication of their evaluation harness design methodology[^19676] is new in kind — it moves from citing a score to explaining the harness engineering behind it, providing the clearest practitioner confirmation yet that top-ranked teams treat harness design as a distinct, disclosable competitive asset.

What

The systems-centric view of AI agent performance is now institutionally grounded across research, practitioner, and enterprise layers. 'Context engineering' has been formalized in an arXiv paper[3] and enterprise frameworks[4][5], cementing the vocabulary shift away from 'prompt engineering.' Meta's ARE (Agent Runtime Environments) paper is receiving broad amplification through explainers, podcasts, and researcher commentary,[8][9][7][11][12][10][16] elevating infrastructure-level thinking about agent evaluation into mainstream AI discourse. Warp's engineers have publicly detailed how they designed the evaluation harness behind their SWE-bench and Terminal-Bench scores,[13] offering a rare practitioner account of harness engineering as a competitive discipline.

Why it matters

The field is crossing a threshold from informal argument to formal infrastructure: an arXiv paper on context engineering, a Meta research program on agent evaluation environments, and a top-ranked company publishing their harness design methodology together represent the kind of institutional weight that reshapes how practitioners allocate engineering effort. When leading teams publicly credit their harness design — not their model or prompts — for benchmark performance, that signal propagates through hiring, tooling investment, and research agendas.

Open questions

Will the arXiv formalization of 'context engineering'[3] produce a canonical definition the industry converges on, or will the term fragment into vendor-specific interpretations as enterprise frameworks like Atlan's[4] adapt it to data catalog use cases?
Warp's public harness design post[13] mentions 71% on SWE-bench Verified while other reporting cites 75.8%[17] — what accounts for this gap, and will other leading teams publish comparable harness methodology disclosures that enable cross-system comparison?
Meta's ARE paper[7] applies infrastructure-level thinking to evaluation environments — will this establish a new evaluation standard where harness configuration is a required disclosure alongside model version, or will benchmark leaderboards continue to obscure the harness contribution?
As context engineering frameworks mature for enterprise use[4][5], will they converge with agent memory architectures[18][19] into a unified discipline, or evolve as separate fields with separate benchmarks and tooling?

Narrative

The argument that AI agent performance is a systems problem — not a model problem and not a prompts problem — has moved from practitioner observation to formal research and institutional documentation. The central claim, articulated by researcher Rohan Paul and supported by Meta research, is that the 'harness' surrounding a model — its orchestration logic, tool wiring, memory management, and context construction — determines most of an agent's emergent capability.[1] A complementary finding from Meta showed that coding agents improve substantially when given compressed summaries of prior solution attempts rather than raw execution logs,[2] identifying information architecture as a performance bottleneck distinct from compute scaling.

The vocabulary for this systems view has now crystallized into a formal term: 'context engineering.' An arXiv paper specifically addresses context engineering as a discipline,[3] joining an enterprise framework from Atlan[4] and a state-of-the-field assessment on Maven[5] that together give the term institutional grounding. The shift from 'prompt engineering' to 'context engineering' is substantive, not cosmetic: it reframes the engineering challenge from crafting text inputs to designing the information environment — what a model receives, when it receives it, how past context is compressed and retrieved. LinkedIn practitioner commentary reflects how widely this reframing has propagated in 2026 AI discussions.[6]

Meta's ARE (Agent Runtime Environments) paper has become a focal point for the infrastructure dimension of this argument. The paper applies systems-level thinking not just to how agents operate but to how they are evaluated — treating evaluation environments as scalable infrastructure requiring their own engineering discipline.[7] The paper is receiving broad amplification: blog explainers,[8][9] a researcher video,[10] a practitioner podcast episode,[11] and researcher LinkedIn commentary from Thomas Scialom.[12] This amplification pattern suggests the ARE framing is resonating beyond the research community into the practitioner layer that actually builds agent systems.

The competitive stakes of harness design are illustrated most directly by Warp, whose engineers published a detailed account of how they designed the evaluation harness behind their SWE-bench Verified and Terminal-Bench results.[13] The post is notable for foregrounding harness methodology — not model selection or prompt design — as the differentiating factor in benchmark performance. This joins Martin Fowler's practitioner guide on harness engineering[14] and LangChain's anatomy-of-a-harness documentation[15] as a growing body of practitioner literature treating harness design as a distinct, teachable engineering discipline. The measurement challenge remains: cross-harness ablations on the same model are absent from public benchmarks, making it difficult to quantify how much of a given score is attributable to the harness versus the model — the empirical gap at the center of the systems-centric argument.

Timeline

2026-05-12: Hacker News discussion surfaces emotional and human-factors costs of AI-assisted coding [29]
2026-05-15: Industry analysis frames Salesforce's agentic AI bets as potentially its most valuable business unit [30]
2026-05-18: LangChain's strategic direction on agent harnesses draws practitioner endorsement [21]
2026-05-19: Practitioner commentary notes widespread model benchmarking without equivalent harness or systems evaluation [26]
2026-05-23: Rohan Paul synthesizes research arguing the agent harness is the primary determinant of agent performance, not the model or prompts [1]
2026-05-23: Rohan Paul highlights Meta paper showing coding agents improve substantially when given summarized rather than raw prior-attempt logs [2]
2026-05: Multiple independent publications shift vocabulary from 'prompt engineering' to 'context engineering' as the preferred frame for agent information architecture [22][23][27]
2026-05: Multiple 2026 state-of-AI-agent-memory surveys appear, broadening the memory architecture conversation into a recognized field-wide concern [18][25][19]
2026-05: Meta publishes ARE paper on scaling agent environments and evaluations, applying infrastructure-level thinking to the evaluation problem [20][7]
2026-05: ARE paper receives broad amplification through explainers, practitioner podcasts, and researcher commentary [8][9][11][12][10][16]
2026-05: arXiv paper on 'context engineering' formalizes the discipline; enterprise frameworks and state-of-field assessments follow [3][4][6][5]
2026-05: Warp engineers publish detailed account of evaluation harness design behind their SWE-bench Verified and Terminal-Bench results [13]
2026-05: Warp reports 75.8% on SWE-bench Verified, illustrating competitive stakes of harness-level performance optimization [17]

Perspectives

Rohan Paul (@rohanpaul_ai)

Strong advocate for a systems-centric view of agent performance; argues harness design and memory architecture are the primary levers, with model and prompt choices being secondary.

Evolution: Consistent across both posts; the two arguments are complementary — one systemic (harness), one specific (memory summarization).

[1][2]

Meta Research

Empirical and infrastructure-focused: coding agents benefit from summarized past-attempt memory over raw logs; the ARE paper extends this into a broader research program treating agent evaluation environments as scalable infrastructure requiring their own engineering discipline.

Evolution: The ARE paper's broad amplification this pass[8][9][11][12][10] signals the framing is resonating beyond the research community into practitioner circles, broadening Meta's influence on the systems-centric argument.

[2][20][7][8][9][11][12][10][16]

Martin Fowler

Practitioner-oriented: harness engineering is a distinct skill set that coding-agent users should develop, separate from prompt engineering or model evaluation.

Evolution: Consistent with his broader orientation toward software engineering discipline.

[14]

LangChain

Infrastructure provider actively defining and promoting the 'agent harness' framing, publishing anatomical documentation to establish vocabulary and architecture patterns.

Evolution: Consistent with LangChain's commercial interest in harness-layer tooling.

[15][21]

Context engineering advocates (arXiv, Atlan, Neo4j, DataHub, Maven)

Argue that 'context engineering' — the discipline of designing what information a model receives and when — is the correct successor frame to 'prompt engineering,' representing a shift from input crafting to information architecture design. An arXiv paper now formalizes the discipline.

Evolution: Significantly strengthened this pass: the vocabulary shift is now backed by a formal academic paper[3] and enterprise framework documentation[4][5], moving from practitioner preference to institutional grounding.

[22][23][3][4][6][5]

Mem0 / AI agent memory research community

Treats AI agent memory as a multi-faceted, field-level engineering problem with distinct architectures, benchmarks, and tradeoffs — not merely a context-window management issue.

Evolution: Consistent with prior contributions; the broader context engineering framing now provides an umbrella term that encompasses memory architecture.

[24][18][25][19]

Warp (Abhishek P. and team)

Competitive practitioner: publishes both performance claims (71% SWE-bench Verified, first place Terminal-Bench) and explicit harness design methodology, directly crediting harness engineering — not model selection — as the differentiating factor.

Evolution: Expanded from citing a benchmark score to publishing a practitioner account of evaluation harness design methodology,[13] making Warp the clearest example of a company treating harness engineering as a public-facing competitive asset.

[17][13]

Ethan (@DuoEthan)

Critical of the current state of agent evaluation: observes that model benchmarking is ubiquitous while systems-level evaluation is absent.

Evolution: Single data point; no evolution to track.

[26]

Tensions

Prompt-centric vs. systems-centric improvement: The dominant industry practice of tuning prompts and selecting models is implicitly challenged by the claim that the harness is the real determinant of agent behavior.[1] No named voice actively defends the prompt-centric view in this thread, but the tension is structural — much existing tooling and evaluation infrastructure is built around it.[26] [1][26]
'Prompt engineering' vs. 'context engineering' as vocabulary: An arXiv paper,[3] enterprise frameworks,[4][5] and practitioner commentary[6] argue that 'context engineering' better describes the actual challenge — curating the information environment, not crafting text — while the legacy 'prompt engineering' frame continues to dominate industry job titles and tooling. This is partly a vocabulary debate but also a substantive disagreement about what the core engineering problem is. [3][4][6][5][22][23][27]
More attempts vs. better memory: A naive approach to improving coding agents is to run more attempts; the Meta finding argues that the mechanism for retaining and reusing past attempts matters as much as attempt count.[2] This creates a tension between compute-scaling intuitions and information-architecture approaches to agent improvement. [2]
Benchmarks measure models, not systems: Existing agent benchmarks are typically reported against model versions and vendor systems, not harness configurations — making it difficult to isolate harness contributions to performance. Warp's benchmark results[13][17] are system-level outcomes, not harness-ablation studies, even as Warp explicitly credits harness design as the differentiating factor. The practitioner commentary[26] and Meta's ARE paper[7] both imply this is a measurement gap, but no authoritative benchmark framework for harness comparison yet exists. [26][7][13][17][28][1]

Sources

[1] This paper shows that agent performance depends less on prompts alone and more on the harness around them. — Rohan Paul Twitter (2026-05-23)
[2] Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs… — Rohan Paul Twitter (2026-05-23)
[3] [PDF] Context Engineering: - arXiv — reactive:agent-performance-architecture
[4] Context Engineering Framework for Enterprise AI in 2026 | Atlan — reactive:agent-performance-architecture
[5] State of Context Engineering in 2026 - Maven — reactive:agent-performance-architecture
[6] Paweł Huryn's Post - LinkedIn — reactive:agent-performance-architecture
[7] [2509.17158] ARE: Scaling Up Agent Environments and Evaluations — reactive:agent-performance-architecture
[8] Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent ... — reactive:agent-performance-architecture
[9] ARE: Scaling Up Agent Environments and Evaluations - Medium — reactive:agent-performance-architecture
[10] Meta AI Researcher explains ARE: scaling up agent environments ... — reactive:agent-performance-architecture
[11] Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent Environments and Evaluations — reactive:agent-performance-architecture
[12] 🚀 ARE: scaling up agent environments and evaluations Everyone talks… | Thomas Scialom, PhD | 11 comments — reactive:agent-performance-architecture
[13] Warp just reached first place on Terminal-Bench and scored 71% on SWE-Bench Verified. Here's how we designed the evaluation harness that these benchmarks ran on. Had a great time working with Roland… | Abhishek P. | 13 comments — reactive:agent-performance-architecture
[14] Harness engineering for coding agent users - Martin Fowler — reactive:agent-performance-architecture
[15] The Anatomy of an Agent Harness - LangChain — reactive:agent-performance-architecture
[16] ARE: scaling up agent environments and evaluations - arXiv — reactive:agent-performance-architecture
[17] Warp: Warp scores 75.8% on SWE-bench Verified! — reactive:agent-performance-architecture
[18] State of AI Agent Memory 2026: Benchmarks, Architectures ... - Mem0 — reactive:agent-performance-architecture
[19] The State of AI Agent Memory in 2026: What the Research Actually ... — reactive:agent-performance-architecture
[20] ARE: scaling up agent environments and evaluations - Meta AI — reactive:meta-surveillance-layoffs
[21] @LangChain_OSS @hwchase17 Really strong direction. — reactive:agent-performance-architecture (2026-05-18)
[22] Why AI Teams Are Moving From Prompt Engineering to Context ... — reactive:agent-performance-architecture
[23] Context Engineering vs Prompt Engineering | DataHub — reactive:agent-performance-architecture
[24] [PDF] Mem0: Building Production-Ready AI Agents with - arXiv — reactive:agent-performance-architecture
[25] The Brains Behind the Bots: A Comprehensive Guide to AI Agent ... — reactive:agent-performance-architecture
[26] Everyone is benchmarking models. — reactive:agent-performance-architecture (2026-05-19)
[27] [D] AI infra vs prompt engineering : r/MachineLearning - Reddit — reactive:agent-performance-architecture
[28] Architectural Design Decisions in AI Agent Harnesses - arXiv — reactive:ai-agent-architecture-limits
[29] The Emotional Cost of AI-Assisted Coding — reactive:agent-performance-architecture (2026-05-12)
[30] There's a $50B company hiding inside Salesforce — reactive:coding-agent-industry-pivot (2026-05-15)