ByteDance Seed delivered again.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-07-03

ByteDance Seed released EdgeBench, a benchmark of 134 real-world tasks running 12–72 hours, finding that AI agent performance follows a log-sigmoid learning curve over environment interaction time, with top models doubling their learning speed every three months.

Open original ↗

Appears in

Rapid AI Benchmark Improvement: Small Models and New Entrants Closing Capability Gaps

Extraction

Topics: ai-benchmarksai-agentslong-horizon-tasksagent-learningevaluation-methodology

Claims

EdgeBench evaluates AI agents on 134 real-world tasks requiring 12–72 hours of interaction, shifting the measurement focus from static knowledge retrieval to in-context experiential learning.
After 38,000 total agent-hours of runs, agent performance as a function of environment interaction time is precisely fit by a log-sigmoid curve.
Top frontier models are roughly doubling their 2-hour learning speed every three months, suggesting a scaling law for environment-based learning.
Most existing benchmarks are too short to measure experiential improvement and instead test one-shot reasoning, memory, or coding skill.
Each EdgeBench task includes a local workspace for rapid trial-and-error plus a hidden judge providing stronger feedback on submitted work, approximating expert evaluation conditions.

Key quotes

The big deal is that it shifts AI evaluation from 'what does the model already know?' to 'can the model learn while doing real work?'

When scores are averaged across many tasks, learning follows a very clean log-sigmoid curve, meaning progress is slow, then faster, then starts to level off.

Newer agents seem to learn from environments much faster, with the top models roughly doubling their 2-hour learning speed every 3 months.