Why Robots Are Ditching Language-First Models for World Action Models

"Our generation was born too late to explore the Earth and too early to explore the stars, but we are born just in time to solve robotics."

That quote, from Jim Fan at Sequoia's AI Ascent 2026, captures the current moment in robotics perfectly. Fan leads Nvidia's embodied autonomous research group—essentially Nvidia Robotics—and he delivered a talk that laid out a bold thesis: robotics is entering its endgame, and the playbook is already written because it mirrors the one large language models (LLMs) followed. Here's a deep dive into his 20-minute presentation and what it means for the future of embodied AI.

The Great Parallel: Copying the LLM Playbook

Fan's core argument is what he calls "the great parallel"—robotics will walk the exact same path as LLMs. With characteristic frankness, he stated, "As any self-respecting scientist would do, I copy homework and I give it a new name." The LLM trajectory unfolded in four stages over six years:

Why Robots Are Ditching Language-First Models for World Action Models — Source: dev.to

Pre-training (GPT-3) — Learning the shape of language through next-token prediction.
Supervised fine-tuning (InstructGPT) — Aligning the model to perform useful tasks.
Reasoning (o1) — Using reinforcement learning to surpass imitation learning.
Auto research — Accelerating the improvement loop beyond human capability.

For robotics, the parallel is straightforward: instead of predicting the next token in a string, predict the next physical world state. Then align through action fine-tuning onto the slice of that simulation that matters for real robots. Finally, let reinforcement learning carry the last mile. As Fan explains, it's the same recipe—just applied to a different domain.

Why Vision-Language-Action Models Fall Short

For the past three years, vision-language-action models (VLAs) have dominated robotics. Models like pi0 and Nvidia's own GR00T fall into this category. The approach: take a vision-language model and graft an action head on top. But Fan offers a pointed critique.

He argues these models are really "LVAs"—because the bulk of parameters are dedicated to language. Language is the first-class citizen, followed by vision, then action. The result: VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs. It's head-heavy in the wrong places. Fan's example: the original VLA paper showed a robot moving a Coke can to a picture of Taylor Swift—impressive generalization to an unseen concept, but not the kind of pre-training ability robotics actually needs. Robots need to understand physical dynamics, not just semantic labels.

As outlined in The Great Parallel, the solution lies in shifting the focus from language prediction to world prediction.

World Action Models: Learning Physics from Video

The replacement for VLAs comes from an unexpected source: AI-generated video. Fan acknowledges the irony—nobody takes AI video slop of cats playing banjo seriously. But something important is happening under the hood. Video models like VEO-3 are learning to simulate physics internally. They pick up gravity, buoyancy, lighting, reflection, and refraction—all by themselves, without explicit coding. As Fan puts it: "Physics emerge by predicting the next blob of pixels at scale."

Even visual planning emerges: VEO-3 can solve mazes by running simulation forward in its latent space. This is exactly what robotics needs—a model that understands cause and effect in the physical world, not just static knowledge. The next step is to transform these video world models into World Action Models by fine-tuning them on action sequences, then applying RL to internalize motor control.

Fan's vision is clear: the end-game for robotics is a unified model that learns world dynamics from video and then adapts to action through the same pipeline that made LLMs successful. The era of language-first robotics is ending; the era of physics-first, action-centric models has begun.

Conclusion

Jim Fan's talk at AI Ascent 2026 offers a roadmap that feels both inevitable and exciting. By copying the proven LLM playbook—pre-training on world dynamics, supervised fine-tuning on actions, and RL-powered reasoning—robotics can finally achieve the generalization and capability that has long eluded it. The death of VLAs isn't a loss; it's a necessary evolution toward models that truly understand and act in the physical world.

Tags:

Why Robots Are Ditching Language-First Models for World Action Models

The Great Parallel: Copying the LLM Playbook

Why Vision-Language-Action Models Fall Short

World Action Models: Learning Physics from Video

Conclusion

Related Articles

Recommended

Discover More