Most video models look better than they understand and Video quality is only the easiest thing to notice.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-02

LongCat releases WBench, a benchmark that evaluates video world models on control, multi-turn memory, and instruction-following rather than visual quality, exposing a systematic gap between how good models look and how well they understand the world.

Open original ↗

Appears in

World Models: Theory, Infrastructure, and Evaluation Converge

Extraction

Topics: video-modelsbenchmarksworld-modelsai-evaluation

Claims

Most video generation models score well on visual quality while lacking genuine world understanding.
Visual quality is the easiest and most commonly measured metric for video models, making it a misleading proxy for capability.
LongCat has released WBench, a new benchmark specifically designed to stress-test video world models.
WBench evaluates control, multi-turn memory, and instruction-following rather than perceptual quality alone.

Key quotes

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following