Better Models: Worse Tools

Simon Willison · Simon Willison · 2026-07-04

Simon Willison relays Armin Ronacher's finding that Anthropic's newer flagship models (Opus 4.8, Sonnet 5) are more likely to invent extra fields in custom tool schemas than older models, with the likely cause being RL training that over-fitted them to Claude Code's own built-in edit tool.

Open original ↗

Appears in

SemiAnalysis Demystifies Agentic Coding Harness Architecture: Model vs. Orchestration

Extraction

Topics: llm-tool-usemodel-regressioncoding-agentsreinforcement-learninganthropic

Claims

Newer Anthropic models Opus 4.8 and Sonnet 5 are worse at conforming to custom edit tool schemas than their older predecessors, inventing extra fields not present in the schema.
The regression is theorized to result from RL training that specifically optimised these models for Claude Code's built-in search-and-replace edit tool, causing interference with differently shaped tool schemas.
Third-party coding harnesses using custom edit tools may need to implement multiple tool variants to maintain performance across different underlying models.
OpenAI's Codex uses a different edit mechanism (apply_patch) and has been explicitly trained on it, illustrating how tool-specific training creates model-to-harness coupling.

Key quotes

What surprised me is that this is getting worse with newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the older models. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings.

Does this mean third-party coding harnesses like Pi should implement multiple edit tools just so they can use the one with the best performance for the underlying model the user has selected?