Incredible! This is just the benchmark we needed.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-09
Cognition launches FrontierCode, a new coding benchmark that scores Claude Opus 4.8 at 13.4%, GPT-5.5 at 6.3%, and Gemini 3.1 Pro at 4.7%, exposing major capability gaps in frontier models on hard coding tasks.
Appears in
Extraction
Topics: coding-benchmarksai-evaluationlarge-language-modelsfrontier-ai
Claims
- Cognition has introduced FrontierCode, a coding benchmark specifically designed to stress-test frontier AI models.
- Claude Opus 4.8 achieves only 13.4% on FrontierCode, the highest score among models mentioned.
- GPT-5.5 scores 6.3% and Gemini 3.1 Pro scores 4.7% on FrontierCode.
- Current frontier models perform poorly on FrontierCode, suggesting substantial headroom remains in coding capability.
Key quotes
Claude Opus 4.8, achieves a score of only 13.4%. Other models score even lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.
Incredible! This is just the benchmark we needed.