Great piece from Dr. Fei-Fei Li (@drfeifei)

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-04

Dr. Fei-Fei Li argues that because the physical world is not made of words, AI models capable of simulation—not just language—are needed to ground understanding in pixels and action predictions for embodied agents.

Open original ↗

Appears in

World Models: Theory, Infrastructure, and Evaluation Converge

Extraction

Topics: embodied-aiworld-modelsllm-limitationssimulation

Claims

LLMs learn patterns in text, which constrains their understanding to linguistic representations of the world.
A model that masters simulation can project understanding into pixels for human consumption and into action predictions for embodied agents.
The physical world is not made of words, pointing to a fundamental ceiling for language-only AI.

Key quotes

The world is not made of words.... A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents.

LLMs learn patterns in text, so they can explain a [world they haven't directly perceived].