The Information Machine

Research: Agent Performance Is a Systems Problem, Not a Prompts Problem · history

Version 2

2026-05-25 04:46 UTC · 46 items

What

The systems-centric view of AI agent performance — that the surrounding infrastructure ('harness') shapes agent behavior more than model or prompt choices alone — is crystallizing into new vocabulary and tooling. A significant linguistic shift is underway: multiple independent sources now prefer 'context engineering' over 'prompt engineering' as the frame for what AI teams should be optimizing.[6][7] The AI agent memory conversation has broadened from a single Meta finding into a recognized field-wide concern, with multiple 2026 state-of-the-field surveys appearing.[3][5] Warp's 75.8% score on SWE-bench Verified[10] illustrates real-world competitive stakes, while SWE-bench's own documentation formalizes a 'harness' concept at the evaluation layer itself.[9]

Why it matters

The shift from 'prompt engineering' to 'context engineering' is significant not just as terminology but as a reframing of where engineering effort should go — from crafting text inputs to designing information architecture. If standardized evaluations now treat the harness as a first-class architectural component,[9] that creates the infrastructure for controlled ablations that would let teams rigorously measure harness versus model contributions to performance — the measurement gap that has made the systems-centric argument hard to prove empirically.

Open questions

  • Will 'context engineering' displace 'prompt engineering' as the standard industry term, and will that vocabulary shift meaningfully accelerate investment in harness and memory infrastructure?[6][7]

  • How large is the actual performance delta attributable to harness design versus model capability in controlled ablations? Warp's 75.8% SWE-bench score[10] is a competitive data point, but cross-harness comparisons on the same model remain absent from public benchmarks.[9]

  • As agent memory becomes a recognized field with its own surveys and benchmarks,[3][5] will standardized memory evaluation metrics emerge alongside harness evaluation metrics — and how will the two interact?

  • What are the failure modes of summarizing past attempts? Summaries compress information; it remains unclear when lossy summarization hurts more than raw-log overload helps.[2]

Narrative

The central argument gaining traction in May 2026 is that AI agent performance should be understood as a systems problem — not a model problem and not a prompts problem. The 'harness' — the orchestration logic, tool wiring, memory management, and context construction surrounding the model — is where most emergent capability lives, according to researcher Rohan Paul, who synthesized research arguing that 'many AI agents look like 1 model, but their real behavior comes from surrounding code' and that 'agent intelligence is becoming partly a systems problem.'[1] This claim directly challenges the dominant industry practice of selecting models on benchmarks and tuning prompts, suggesting that both activities optimize the wrong variable.

A parallel and complementary finding from Meta reinforces the memory dimension of this argument: coding agents perform substantially better when given short summaries of prior solution attempts rather than raw execution logs.[2] The implication is that the bottleneck is not compute — running more attempts — but information architecture: how past experience is encoded, compressed, and made accessible to the model on subsequent attempts. This finding is supported by a broader body of 2026 work on AI agent memory, now mature enough to generate multiple state-of-the-field surveys and benchmark efforts tracking how different memory architectures affect agent behavior.[3][4][5]

The vocabulary for this systems-centric view is now coalescing around a new term: 'context engineering.' Multiple sources have identified a shift away from 'prompt engineering' as the dominant frame, with AI teams and practitioners increasingly preferring 'context engineering' to describe the challenge of curating what information a model receives and when.[6][7] A Reddit thread in the MachineLearning community has similarly surfaced the AI infrastructure versus prompt engineering debate.[8] This framing is broader than harness engineering per se — it encompasses retrieval strategy, memory design, and context window management — but it points in the same direction: the real engineering challenge is building the information environment around the model, not crafting the text inputs to it.

The institutional infrastructure around these ideas is catching up. SWE-bench, the widely used coding agent benchmark, maintains formal documentation of its own 'harness' concept, treating evaluation infrastructure as a named architectural component.[9] Warp's 75.8% score on SWE-bench Verified[10] illustrates the competitive stakes: companies are now racing to push agent performance on standardized benchmarks, and harness design is a key variable in that race. Meta's ARE (Agent Runtime Environments) research[11] extends the infrastructure argument to the evaluation layer itself — applying systems-level thinking not just to how agents operate, but to how they are measured. Practitioners formalizing this sub-discipline include Martin Fowler, who has published a practitioner-oriented guide on harness engineering for coding agent users,[12] and LangChain, which has published anatomy-of-a-harness documentation to establish shared vocabulary.[13]

Timeline

  • 2026-05-12: Hacker News discussion surfaces emotional and human-factors costs of AI-assisted coding [19]
  • 2026-05-15: Industry analysis frames Salesforce's agentic AI bets as potentially its most valuable business unit [20]
  • 2026-05-18: LangChain's strategic direction on agent harnesses draws practitioner endorsement [15]
  • 2026-05-19: Practitioner commentary notes widespread model benchmarking without equivalent harness or systems evaluation [17]
  • 2026-05-23: Rohan Paul synthesizes research arguing the agent harness is the primary determinant of agent performance, not the model or prompts [1]
  • 2026-05-23: Rohan Paul highlights Meta paper showing coding agents improve substantially when given summarized rather than raw prior-attempt logs [2]
  • 2026-05: Warp reports 75.8% on SWE-bench Verified, illustrating competitive stakes of harness-level performance optimization [10]
  • 2026-05: Multiple independent publications shift vocabulary from 'prompt engineering' to 'context engineering' as the preferred frame for agent information architecture [6][7][8]
  • 2026-05: Multiple 2026 state-of-AI-agent-memory surveys appear, broadening the memory architecture conversation into a recognized field-wide concern [3][4][5]
  • 2026-05: Meta publishes ARE paper on scaling agent environments and evaluations, applying infrastructure-level thinking to the evaluation problem [11]

Perspectives

Rohan Paul (@rohanpaul_ai)

Strong advocate for a systems-centric view of agent performance; argues harness design and memory architecture are the primary levers, with model and prompt choices being secondary. Frames this as a counterintuitive finding that should redirect engineering focus.

Evolution: Consistent across both posts; the two arguments are complementary — one systemic (harness), one specific (memory summarization).

Meta Research

Empirical: coding agents benefit significantly from summarized past-attempt memory over raw logs. The ARE paper extends this to infrastructure-level thinking about how agent evaluation environments should be scaled, not just how agents should be built.

Evolution: The ARE paper[11] broadens Meta's contribution from a single memory-summarization finding to a broader research program on agent evaluation infrastructure.

Martin Fowler

Practitioner-oriented: harness engineering is a distinct skill set that coding-agent users should develop, separate from prompt engineering or model evaluation.

Evolution: Consistent with his broader orientation toward software engineering discipline; this extends it to the AI agent domain.

LangChain

Infrastructure provider actively defining and promoting the 'agent harness' framing, publishing anatomical documentation to establish vocabulary and architecture patterns.

Evolution: Consistent with LangChain's commercial interest in harness-layer tooling; the direction is endorsed by practitioners in the thread.

Neo4j / DataHub / context engineering advocates

Argue that 'context engineering' — the discipline of designing what information a model receives and when — is the correct successor frame to 'prompt engineering,' representing a shift from input crafting to information architecture design.

Evolution: New voice in this pass; represents a vocabulary crystallization of arguments that were previously implicit in the harness-engineering framing.

Mem0 / AI agent memory research community

Treats AI agent memory as a multi-faceted, field-level engineering problem with distinct architectures, benchmarks, and tradeoffs — not merely a context-window management issue.

Evolution: Consistent with prior Mem0 contributions; expanded by the emergence of multiple independent 2026 state-of-memory surveys signaling field maturation.

Ethan (@DuoEthan)

Critical of the current state of agent evaluation: observes that model benchmarking is ubiquitous while systems-level evaluation is absent, implicitly supporting the idea that the field is measuring the wrong things.

Evolution: Single data point; no evolution to track.

Warp

Competitive practitioner: publishes SWE-bench Verified scores (75.8%) as a performance claim, implicitly treating harness optimization as a key differentiator in the coding agent market.

Evolution: New voice in this pass; adds a concrete commercial data point to what has been a largely research-and-practitioner-writing discussion.

Tensions

  • Prompt-centric vs. systems-centric improvement: The dominant industry practice of tuning prompts and selecting models is implicitly challenged by the claim that the harness is the real determinant of agent behavior.[1] No named voice actively defends the prompt-centric view in this thread, but the tension is structural — much existing tooling and evaluation infrastructure is built around it.[17] [1][17]
  • 'Prompt engineering' vs. 'context engineering' as vocabulary: Neo4j, DataHub, and a Reddit ML thread argue that 'context engineering' better describes the actual challenge — curating the information environment, not crafting text — while the legacy 'prompt engineering' frame continues to dominate industry job titles and tooling.[6][7][8] This is partly a vocabulary debate but also a substantive disagreement about what the problem statement is. [6][7][8]
  • More attempts vs. better memory: A naive approach to improving coding agents is to run more attempts; the Meta finding argues that the mechanism for retaining and reusing past attempts matters as much as attempt count.[2] This creates a tension between compute-scaling intuitions and information-architecture approaches to agent improvement. [2]
  • Benchmarks measure models, not systems: Existing agent benchmarks (e.g., SWE-bench) are typically reported against model versions and vendor systems, not harness configurations — making it difficult to isolate harness contributions to performance. Warp's 75.8% score[10] is a system-level result, not a harness-ablation study. The practitioner commentary[17] and harness-centric research[18] both imply this is a measurement gap, but no authoritative benchmark framework for harness comparison yet exists.[9] [17][18][1][10][9]

Sources

  1. [1] This paper shows that agent performance depends less on prompts alone and more on the harness around them. — Rohan Paul Twitter (2026-05-23)
  2. [2] Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs… — Rohan Paul Twitter (2026-05-23)
  3. [3] State of AI Agent Memory 2026: Benchmarks, Architectures ... - Mem0 — reactive:agent-performance-architecture
  4. [4] The Brains Behind the Bots: A Comprehensive Guide to AI Agent ... — reactive:agent-performance-architecture
  5. [5] The State of AI Agent Memory in 2026: What the Research Actually ... — reactive:agent-performance-architecture
  6. [6] Why AI Teams Are Moving From Prompt Engineering to Context ... — reactive:agent-performance-architecture
  7. [7] Context Engineering vs Prompt Engineering | DataHub — reactive:agent-performance-architecture
  8. [8] [D] AI infra vs prompt engineering : r/MachineLearning - Reddit — reactive:agent-performance-architecture
  9. [9] The Harness - SWE-bench — reactive:agent-performance-architecture
  10. [10] Warp: Warp scores 75.8% on SWE-bench Verified! — reactive:agent-performance-architecture
  11. [11] ARE: scaling up agent environments and evaluations - Meta AI — reactive:meta-surveillance-layoffs
  12. [12] Harness engineering for coding agent users - Martin Fowler — reactive:agent-performance-architecture
  13. [13] The Anatomy of an Agent Harness - LangChain — reactive:agent-performance-architecture
  14. [14] Meta paper reveals improved coding agents through summary reuse — reactive:agent-performance-architecture
  15. [15] @LangChain_OSS @hwchase17 Really strong direction. — reactive:agent-performance-architecture (2026-05-18)
  16. [16] [PDF] Mem0: Building Production-Ready AI Agents with - arXiv — reactive:agent-performance-architecture
  17. [17] Everyone is benchmarking models. — reactive:agent-performance-architecture (2026-05-19)
  18. [18] Architectural Design Decisions in AI Agent Harnesses - arXiv — reactive:ai-agent-architecture-limits
  19. [19] The Emotional Cost of AI-Assisted Coding — reactive:agent-performance-architecture (2026-05-12)
  20. [20] There's a $50B company hiding inside Salesforce — reactive:coding-agent-industry-pivot (2026-05-15)