Research: Agent Performance Is a Systems Problem, Not a Prompts Problem · history

Version 2

2026-05-25 04:46 UTC · 46 items

Changes since v1

The most significant development this pass is a vocabulary crystallization: 'context engineering' is emerging as the preferred successor to 'prompt engineering' across multiple independent publications, representing a reframing of what AI teams should be optimizing.[^16041][^16043][^16042] The AI agent memory conversation has also broadened from a single Meta finding into a recognized field-wide concern with multiple 2026 state-of-the-art surveys.[^16038][^16040] Two additional developments sharpen the empirical grounding: Warp's 75.8% SWE-bench Verified score[^16035] adds a concrete competitive data point, and Meta's ARE paper[^13465] extends the infrastructure argument to the evaluation layer itself — applying systems-level thinking to how agents are measured, not just how they operate.

What

The systems-centric view of AI agent performance — that the surrounding infrastructure ('harness') shapes agent behavior more than model or prompt choices alone — is crystallizing into new vocabulary and tooling. A significant linguistic shift is underway: multiple independent sources now prefer 'context engineering' over 'prompt engineering' as the frame for what AI teams should be optimizing.[6][7] The AI agent memory conversation has broadened from a single Meta finding into a recognized field-wide concern, with multiple 2026 state-of-the-field surveys appearing.[3][5] Warp's 75.8% score on SWE-bench Verified[10] illustrates real-world competitive stakes, while SWE-bench's own documentation formalizes a 'harness' concept at the evaluation layer itself.[9]

Why it matters

The shift from 'prompt engineering' to 'context engineering' is significant not just as terminology but as a reframing of where engineering effort should go — from crafting text inputs to designing information architecture. If standardized evaluations now treat the harness as a first-class architectural component,[9] that creates the infrastructure for controlled ablations that would let teams rigorously measure harness versus model contributions to performance — the measurement gap that has made the systems-centric argument hard to prove empirically.

Open questions

Will 'context engineering' displace 'prompt engineering' as the standard industry term, and will that vocabulary shift meaningfully accelerate investment in harness and memory infrastructure?[6][7]
How large is the actual performance delta attributable to harness design versus model capability in controlled ablations? Warp's 75.8% SWE-bench score[10] is a competitive data point, but cross-harness comparisons on the same model remain absent from public benchmarks.[9]
As agent memory becomes a recognized field with its own surveys and benchmarks,[3][5] will standardized memory evaluation metrics emerge alongside harness evaluation metrics — and how will the two interact?
What are the failure modes of summarizing past attempts? Summaries compress information; it remains unclear when lossy summarization hurts more than raw-log overload helps.[2]

Narrative

The central argument gaining traction in May 2026 is that AI agent performance should be understood as a systems problem — not a model problem and not a prompts problem. The 'harness' — the orchestration logic, tool wiring, memory management, and context construction surrounding the model — is where most emergent capability lives, according to researcher Rohan Paul, who synthesized research arguing that 'many AI agents look like 1 model, but their real behavior comes from surrounding code' and that 'agent intelligence is becoming partly a systems problem.'[1] This claim directly challenges the dominant industry practice of selecting models on benchmarks and tuning prompts, suggesting that both activities optimize the wrong variable.

A parallel and complementary finding from Meta reinforces the memory dimension of this argument: coding agents perform substantially better when given short summaries of prior solution attempts rather than raw execution logs.[2] The implication is that the bottleneck is not compute — running more attempts — but information architecture: how past experience is encoded, compressed, and made accessible to the model on subsequent attempts. This finding is supported by a broader body of 2026 work on AI agent memory, now mature enough to generate multiple state-of-the-field surveys and benchmark efforts tracking how different memory architectures affect agent behavior.[3][4][5]

The vocabulary for this systems-centric view is now coalescing around a new term: 'context engineering.' Multiple sources have identified a shift away from 'prompt engineering' as the dominant frame, with AI teams and practitioners increasingly preferring 'context engineering' to describe the challenge of curating what information a model receives and when.[6][7] A Reddit thread in the MachineLearning community has similarly surfaced the AI infrastructure versus prompt engineering debate.[8] This framing is broader than harness engineering per se — it encompasses retrieval strategy, memory design, and context window management — but it points in the same direction: the real engineering challenge is building the information environment around the model, not crafting the text inputs to it.

The institutional infrastructure around these ideas is catching up. SWE-bench, the widely used coding agent benchmark, maintains formal documentation of its own 'harness' concept, treating evaluation infrastructure as a named architectural component.[9] Warp's 75.8% score on SWE-bench Verified[10] illustrates the competitive stakes: companies are now racing to push agent performance on standardized benchmarks, and harness design is a key variable in that race. Meta's ARE (Agent Runtime Environments) research[11] extends the infrastructure argument to the evaluation layer itself — applying systems-level thinking not just to how agents operate, but to how they are measured. Practitioners formalizing this sub-discipline include Martin Fowler, who has published a practitioner-oriented guide on harness engineering for coding agent users,[12] and LangChain, which has published anatomy-of-a-harness documentation to establish shared vocabulary.[13]

Timeline

2026-05-12: Hacker News discussion surfaces emotional and human-factors costs of AI-assisted coding [19]
2026-05-15: Industry analysis frames Salesforce's agentic AI bets as potentially its most valuable business unit [20]
2026-05-18: LangChain's strategic direction on agent harnesses draws practitioner endorsement [15]
2026-05-19: Practitioner commentary notes widespread model benchmarking without equivalent harness or systems evaluation [17]
2026-05-23: Rohan Paul synthesizes research arguing the agent harness is the primary determinant of agent performance, not the model or prompts [1]
2026-05-23: Rohan Paul highlights Meta paper showing coding agents improve substantially when given summarized rather than raw prior-attempt logs [2]
2026-05: Warp reports 75.8% on SWE-bench Verified, illustrating competitive stakes of harness-level performance optimization [10]
2026-05: Multiple independent publications shift vocabulary from 'prompt engineering' to 'context engineering' as the preferred frame for agent information architecture [6][7][8]
2026-05: Multiple 2026 state-of-AI-agent-memory surveys appear, broadening the memory architecture conversation into a recognized field-wide concern [3][4][5]
2026-05: Meta publishes ARE paper on scaling agent environments and evaluations, applying infrastructure-level thinking to the evaluation problem [11]

Perspectives

Rohan Paul (@rohanpaul_ai)

Strong advocate for a systems-centric view of agent performance; argues harness design and memory architecture are the primary levers, with model and prompt choices being secondary. Frames this as a counterintuitive finding that should redirect engineering focus.

Evolution: Consistent across both posts; the two arguments are complementary — one systemic (harness), one specific (memory summarization).

[1][2]

Meta Research

Empirical: coding agents benefit significantly from summarized past-attempt memory over raw logs. The ARE paper extends this to infrastructure-level thinking about how agent evaluation environments should be scaled, not just how agents should be built.

Evolution: The ARE paper[11] broadens Meta's contribution from a single memory-summarization finding to a broader research program on agent evaluation infrastructure.

[2][14][11]

Martin Fowler

Practitioner-oriented: harness engineering is a distinct skill set that coding-agent users should develop, separate from prompt engineering or model evaluation.

Evolution: Consistent with his broader orientation toward software engineering discipline; this extends it to the AI agent domain.

[12]

LangChain

Infrastructure provider actively defining and promoting the 'agent harness' framing, publishing anatomical documentation to establish vocabulary and architecture patterns.

Evolution: Consistent with LangChain's commercial interest in harness-layer tooling; the direction is endorsed by practitioners in the thread.

[13][15]

Neo4j / DataHub / context engineering advocates

Argue that 'context engineering' — the discipline of designing what information a model receives and when — is the correct successor frame to 'prompt engineering,' representing a shift from input crafting to information architecture design.

Evolution: New voice in this pass; represents a vocabulary crystallization of arguments that were previously implicit in the harness-engineering framing.

[6][7]

Mem0 / AI agent memory research community

Treats AI agent memory as a multi-faceted, field-level engineering problem with distinct architectures, benchmarks, and tradeoffs — not merely a context-window management issue.

Evolution: Consistent with prior Mem0 contributions; expanded by the emergence of multiple independent 2026 state-of-memory surveys signaling field maturation.

[16][3][4][5]

Ethan (@DuoEthan)

Critical of the current state of agent evaluation: observes that model benchmarking is ubiquitous while systems-level evaluation is absent, implicitly supporting the idea that the field is measuring the wrong things.

Evolution: Single data point; no evolution to track.

[17]

Warp

Competitive practitioner: publishes SWE-bench Verified scores (75.8%) as a performance claim, implicitly treating harness optimization as a key differentiator in the coding agent market.

Evolution: New voice in this pass; adds a concrete commercial data point to what has been a largely research-and-practitioner-writing discussion.

[10]

Tensions

Prompt-centric vs. systems-centric improvement: The dominant industry practice of tuning prompts and selecting models is implicitly challenged by the claim that the harness is the real determinant of agent behavior.[1] No named voice actively defends the prompt-centric view in this thread, but the tension is structural — much existing tooling and evaluation infrastructure is built around it.[17] [1][17]
'Prompt engineering' vs. 'context engineering' as vocabulary: Neo4j, DataHub, and a Reddit ML thread argue that 'context engineering' better describes the actual challenge — curating the information environment, not crafting text — while the legacy 'prompt engineering' frame continues to dominate industry job titles and tooling.[6][7][8] This is partly a vocabulary debate but also a substantive disagreement about what the problem statement is. [6][7][8]
More attempts vs. better memory: A naive approach to improving coding agents is to run more attempts; the Meta finding argues that the mechanism for retaining and reusing past attempts matters as much as attempt count.[2] This creates a tension between compute-scaling intuitions and information-architecture approaches to agent improvement. [2]
Benchmarks measure models, not systems: Existing agent benchmarks (e.g., SWE-bench) are typically reported against model versions and vendor systems, not harness configurations — making it difficult to isolate harness contributions to performance. Warp's 75.8% score[10] is a system-level result, not a harness-ablation study. The practitioner commentary[17] and harness-centric research[18] both imply this is a measurement gap, but no authoritative benchmark framework for harness comparison yet exists.[9] [17][18][1][10][9]

Sources

[1] This paper shows that agent performance depends less on prompts alone and more on the harness around them. — Rohan Paul Twitter (2026-05-23)
[2] Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs… — Rohan Paul Twitter (2026-05-23)
[3] State of AI Agent Memory 2026: Benchmarks, Architectures ... - Mem0 — reactive:agent-performance-architecture
[4] The Brains Behind the Bots: A Comprehensive Guide to AI Agent ... — reactive:agent-performance-architecture
[5] The State of AI Agent Memory in 2026: What the Research Actually ... — reactive:agent-performance-architecture
[6] Why AI Teams Are Moving From Prompt Engineering to Context ... — reactive:agent-performance-architecture
[7] Context Engineering vs Prompt Engineering | DataHub — reactive:agent-performance-architecture
[8] [D] AI infra vs prompt engineering : r/MachineLearning - Reddit — reactive:agent-performance-architecture
[9] The Harness - SWE-bench — reactive:agent-performance-architecture
[10] Warp: Warp scores 75.8% on SWE-bench Verified! — reactive:agent-performance-architecture
[11] ARE: scaling up agent environments and evaluations - Meta AI — reactive:meta-surveillance-layoffs
[12] Harness engineering for coding agent users - Martin Fowler — reactive:agent-performance-architecture
[13] The Anatomy of an Agent Harness - LangChain — reactive:agent-performance-architecture
[14] Meta paper reveals improved coding agents through summary reuse — reactive:agent-performance-architecture
[15] @LangChain_OSS @hwchase17 Really strong direction. — reactive:agent-performance-architecture (2026-05-18)
[16] [PDF] Mem0: Building Production-Ready AI Agents with - arXiv — reactive:agent-performance-architecture
[17] Everyone is benchmarking models. — reactive:agent-performance-architecture (2026-05-19)
[18] Architectural Design Decisions in AI Agent Harnesses - arXiv — reactive:ai-agent-architecture-limits
[19] The Emotional Cost of AI-Assisted Coding — reactive:agent-performance-architecture (2026-05-12)
[20] There's a $50B company hiding inside Salesforce — reactive:coding-agent-industry-pivot (2026-05-15)