Incredible! This is just the benchmark we needed.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-09

Cognition launches FrontierCode, a new coding benchmark that scores Claude Opus 4.8 at 13.4%, GPT-5.5 at 6.3%, and Gemini 3.1 Pro at 4.7%, exposing major capability gaps in frontier models on hard coding tasks.

Open original ↗

Appears in

Anthropic Launches Claude Fable 5 and Mythos 5: Agentic Capability Leap and Tiered Access

Extraction

Topics: coding-benchmarksai-evaluationlarge-language-modelsfrontier-ai

Claims

Cognition has introduced FrontierCode, a coding benchmark specifically designed to stress-test frontier AI models.
Claude Opus 4.8 achieves only 13.4% on FrontierCode, the highest score among models mentioned.
GPT-5.5 scores 6.3% and Gemini 3.1 Pro scores 4.7% on FrontierCode.
Current frontier models perform poorly on FrontierCode, suggesting substantial headroom remains in coding capability.

Key quotes

Claude Opus 4.8, achieves a score of only 13.4%. Other models score even lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.

Incredible! This is just the benchmark we needed.