★ SERIES 101

Physical AI 101: A Beginner’s Field Guide to World Model

A beginner-friendly field guide to AI world models: what they are, how they evolved, who is building them, where funding is going, and why they matter for robotics, autonomous vehicles, gaming, 3D worlds, and physical AI.

1P · JUDY DUONG·JUNE 29, 2026·14 MIN READ

Physical AI 101: A Beginner’s Field Guide to World Model

World models sound like sci-fi nonsense, but the idea is simple.

A chatbot predicts the next word. A video model predicts the next frame. A world model tries to predict what happens next after something acts.

That is why people care. If AI is going to control robots, cars, games, factories, or virtual worlds, it needs more than language. It needs a rough sense of reality.

Or at least, that is the dream. Reality, as usual, is being dramatic.

Part 1: What Is a World Model? From First Principles

The core idea: an internal simulator of reality

Imagine knocking a glass of water off a table. Before it happens, you can already picture the glass tipping, falling, breaking, and spilling water. You are not physically running a full physics engine in your brain, but you have a compressed internal model of how the world works.

A world model is the machine version of that idea. It builds an internal representation of an environment and uses it to predict how that environment will change, especially in response to actions.

The important phrase is action-conditioned simulation. The model does not only predict what comes next. It predicts what comes next if an action is taken.

What makes it different from an LLM

A large language model predicts the next token. That makes it extremely good at text, summarising, writing, explaining, and following instructions. But language is a compressed representation of reality. The physical world is continuous, noisy, spatial, and full of consequences.

An LLM can explain what may happen if a glass falls. But it does not naturally simulate the trajectory, collision, spill pattern, or where the pieces land. It can reason about physics linguistically, but it is not built as a physical simulator.

This is why world models matter for robotics and autonomous vehicles. A robot does not only need to understand the sentence “pick up the mug.” It needs to know how the mug sits in space, what happens if it grips too loosely, and which movement is safest.

What makes it different from a video generator

This is the more subtle distinction. A video model predicts frames. It can generate realistic-looking motion, lighting, and objects. But if the user cannot act inside the scene, the model is still closer to video than a world.

A world model must respond to actions. If you turn left, the world should reveal what is on the left. If a car brakes, the simulated scene should change accordingly. If a robot pushes an object, the model should predict the object’s response.

Model type	What it predicts	Example
Language model	Next token	“The car turned...”
Video model	Next frame	A generated clip continues
World model	Next state after an action	A robot pushes a box and predicts where it moves

A video model gives you something to watch. A world model gives you something to act inside.

The two philosophies: understanding vs. prediction

A useful fault line runs through the field.

Prediction models try to draw what comes next. Systems like Sora, Genie, Oasis, and other generative video/world systems are visually impressive because they generate future frames or scenes. This makes them easier to productise in creative tools, gaming, and interactive media.

Understanding models try to learn the underlying structure of the world. JEPA-style models and Dreamer-style systems predict in latent space rather than reconstructing pixels. The argument is that raw pixels include too much detail that is irrelevant or unpredictable.

My practical read: prediction wins attention first, but understanding is what makes the category valuable for robotics and planning.

The World Labs taxonomy: renderers, simulators, planners

“World model” is becoming an overloaded term. A useful way to separate the field is by asking what layer the model serves.

Layer	Output	Main question	Example direction
Renderer	Pixels / views	What does the world look like?	Video and interactive visual generation
Simulator	3D state / physics	How does the world behave?	Geometry, materials, physical consequences
Planner	Action	What should the agent do next?	Robotics, VLA models, model-based RL

Renderers are judged on visual fidelity. Simulators produce structure: geometry, materials, physics, and consequences. Planners decide action.

The planner layer is the least solved because it requires the model not only to represent the world, but to choose useful interventions inside it.

What capabilities world models need

The valuable world models share four capabilities: predictive simulation, action-conditioning, spatial and physical intelligence, and persistent memory.

Predictive simulation means the model can roll the world forward. Action-conditioning means the world changes based on what the agent does. Spatial and physical intelligence means the model understands objects, space, motion, gravity, materials, and causality. Persistent memory means the world does not fall apart as soon as the camera turns away.

This is still hard. Interactive worlds degrade because small errors accumulate over time. A model may look convincing for a short session but lose consistency over longer horizons.

Part 2: How to Build One — Step by Step

The canonical architecture: Ha & Schmidhuber 2018

The foundational architecture is the 2018 World Models paper by David Ha and Jürgen Schmidhuber. It split an agent into three modules: Vision, Memory, and Controller.

Component	Technical version	Job
Vision	VAE	Compresses raw images into a compact latent representation
Memory	MDN-RNN	Predicts the next latent state based on current state, action, and history
Controller	Linear policy	Chooses the action from the compressed state

The Vision module compresses raw frames into a smaller latent vector. The Memory module predicts what the next latent state may look like, conditioned on the current latent state, action, and hidden history. The Controller maps the latent state into an action.

The training procedure is: collect rollouts, train the vision model, train the memory model, define the controller, then optimise the controller inside the learned environment.

The famous result was that the agent could train inside its own imagined version of the environment and transfer the learned policy back to the real environment.

The key insight is still the core of the field:

real-world interaction is expensive, but imagination is cheaper.

The modern lineage: how each generation improved

LEVEL MAP

The Technology Lineage

Each generation did not fully replace the one before it. It patched the weakest joint: error drift, then real-time speed, then pixel fidelity, then geometry, then grounding in action.

LV01

VAE + RNN

2018 baseline

▼

LV02

RSSM

Fixes drift

▼

LV03

Transformers

Live and scalable

▼

LV04

Diffusion

Sharper frames

▼

LV05

Gaussian Splats

Editable 3D geometry

▼

LV06

JEPA

Predicts latent representations

▼

LV07

VLA

Robot action bridge

LV01

VAE + RNN

2018 baseline

►

LV02

RSSM

Fixes drift

►

LV03

Transformers

Live and scalable

LV06

JEPA

Predicts latent representations

◄

LV05

Gaussian Splats

Editable 3D geometry

◄

LV04

Diffusion

Sharper frames

LV07

VLA

Robot action bridge

Before going into each branch, this is the easiest way to see the evolution.

This lineage matters because world models did not evolve in one straight line. Each technical generation solved a different weakness.

The original VAE + RNN setup proved the concept: compress the world, predict the next state, and train a controller inside the learned environment. RSSM improved the memory problem by reducing drift and handling uncertainty better. Transformers made longer, more scalable sequence modelling possible. Diffusion improved visual fidelity. Gaussian splats made generated worlds more editable and spatially useful. JEPA shifted the focus from predicting pixels to predicting meaningful latent representations. VLA models then connected world understanding to robot action.

That is why the category now spans video, 3D, robotics, simulation, and planning.

Dreamer and RSSM

The Dreamer series improved the memory architecture through Recurrent State-Space Models. These split the latent state into deterministic memory and stochastic uncertainty.

Dreamer agents learn behaviours by imagining trajectories inside the model, reducing the need for real-world interactions. This matters most in robotics and autonomous vehicles, where real data collection is slow, expensive, and sometimes unsafe.

Genie and autoregressive transformers

Google DeepMind’s Genie line treats world generation more like language generation. It uses video tokenisation, latent actions, and autoregressive prediction to create interactive environments.

Genie 3 is important because it points toward real-time navigable worlds, though consistency still remains limited.

Oasis and real-time generative worlds

Decart and Etched’s Oasis is one of the clearest public examples of a real-time AI-generated game-like world. It responds to keyboard input and generates the next frames based on user action.

The point is not only that it looks like a game. The important point is that the world changes in response to user actions.

Marble and Gaussian splats

World Labs’ Marble outputs persistent, editable 3D worlds using Gaussian splats. Instead of only generating video, it creates 3D scene representations that can be exported into tools such as Unreal, Unity, Blender, and Houdini.

This makes Marble relevant for 3D design, spatial computing, games, VFX, and VR.

JEPA and V-JEPA

Meta’s V-JEPA follows the understanding-first philosophy. Instead of reconstructing pixels, it predicts abstract future representations.

This matters because physical AI needs useful representations of depth, motion, objects, and consequences, not just beautiful frames.

VLA models

VLA means Vision-Language-Action. These models connect what a robot sees, what it is told, and what it should do.

They are the practical bridge from today’s vision-language models to robot foundation models. If world models give AI a sense of consequence, VLA models help turn that understanding into action.

Infrastructure and compute

World models are expensive in a different way from language models. LLM inference is largely about moving huge weights efficiently. World-model inference has to maintain spatial and temporal consistency across frames, often in real time.

This is why NVIDIA is strategically important. The category needs GPUs, simulation frameworks, tokenizers, data pipelines, guardrails, physics engines, and deployment infrastructure. Platforms like Cosmos, Isaac Sim, Isaac Lab, and Omniverse make NVIDIA more than a hardware supplier; they make it part of the world-model development stack.

The data problem

World models need video, spatial data, interaction data, robot trajectories, and vehicle trajectories.

This is the binding constraint, especially in robotics. Real-world robot data is slow to collect, expensive, and sometimes unsafe. A robot cannot make millions of mistakes in a real warehouse as easily as a language model can process text.

The data strategy usually has three parts: collect real-world data, simulate additional scenarios, and use synthetic data to cover edge cases.

The best world-model companies are not only model companies. They are also data companies.

Part 3: How to Use and Apply Them

Getting started today

Most people will not train a world model from scratch. The practical entry point is to use platforms and products that expose world-model capabilities through APIs, tools, or interfaces.

Tool / platform	Best for	Why it matters
NVIDIA Cosmos	Robotics and AV teams	Open world foundation model platform and data pipeline
Isaac Sim / Isaac Lab	Robot simulation and policy validation	Lets teams test robots before real deployment
Omniverse	Industrial digital twins	Simulates factories, warehouses, and physical systems
World Labs Marble	3D design and spatial worlds	Generates persistent editable 3D scenes
Decart Oasis / Lucy	Interactive worlds and live video transformation	Shows real-time action-conditioned generation
Runway GWM	Creative worlds, avatars, and video	Extends video generation toward world models
MuJoCo	Robotics research	Open physics simulation for RL and embodied AI

NVIDIA Cosmos here (Open-source)

Isaac Sim here (Open-source)

Omniverse here (Open-source)

World Lab Marbles here.

Decart Lucy 2.0 here.

Runway GWM 1.0 here.

Mujoco here (Open-source)

For creators, world models may first feel like better 3D or video tools. For robotics teams, they are training and validation infrastructure. For autonomous-driving teams, they are a way to generate rare scenarios.

Practical evaluation questions

The most important practical question is not “is this a world model?” but what role does it play?

Question	Why it matters
Is it action-conditioned?	Without action, it is closer to video generation
Does it stay consistent over time?	Long-horizon memory is still hard
Does it understand 3D structure?	Physical AI needs geometry, not only pixels
Can it transfer to real environments?	Sim2real is the bottleneck
Is it usable in a workflow?	Export, API, latency, and cost matter

A tool can be impressive and still not be useful. Workflow matters.

Concrete use patterns

The clearest use cases are synthetic training data, safe simulation, off-policy evaluation, edge-case generation, and counterfactual testing.

Synthetic training data helps robots and vehicles see more scenarios than real-world collection can provide. Safe simulation lets agents practise before deployment. Off-policy evaluation asks what would have happened if the agent had taken a different action. Edge-case generation creates rare, dangerous, or unusual cases at scale. Counterfactual testing asks how the world changes if lighting, object position, grip angle, camera angle, or vehicle behaviour changes.

This is where world models become valuable. They are not only content generators. They are tools for testing decisions before real-world action.

Part 4: The Landscape — Who Is Building and Applying World Models

World Model Landscape

Supply-side builders and demand-side applications across frontier labs, simulation, robotics, autonomous vehicles, gaming, video, spatial AI, and digital twins.

▣ Frontier & Foundation Labs

▣ Infrastructure & Simulation

▣ Robotics & Embodied AI

▣ Autonomous Vehicles

▣ Gaming & Interactive

▣ Creative & Video

▣ Spatial & 3D

▣ Industrial & Digital Twins

The world-model landscape has two sides: supply and demand.

The supply side includes labs, infrastructure providers, simulation companies, robotics model builders, video/world-generation startups, and autonomous-driving specialists.

The demand side includes robotics, autonomous vehicles, gaming, creative tools, 3D design, industrial automation, healthcare, science, AR/VR, and embodied agents.

Funding context

▣ LATEST DISCLOSED FUNDING ROUNDS IN WORLD MODELS

Selected disclosed rounds for world-model and physical-AI companies. Values shown in US$ billions.

The funding signal is unusually strong for such an early category. This is not just a research theme anymore; it is becoming a company-formation theme.

Capital is flowing into both sides of the category. Wayve represents autonomous-driving demand. World Labs represents spatial intelligence. AMI Labs represents the understanding-first JEPA thesis. Physical Intelligence represents robot foundation models. Decart and General Intuition represent real-time or gameplay-based world generation. Luma AI sits closer to the creative and video-native branch.

The important point is not only the headline numbers. Strategic corporates keep appearing: NVIDIA, Autodesk, Toyota, Uber, Adobe, Microsoft, and others. They are not only financial investors. They are potential customers, distribution partners, infrastructure suppliers, or ecosystem owners.

This is why the category matters. World models are not a narrow product category. They are a horizontal technology layer that can touch many physical and digital markets.

Challenges and Limitations

Temporal consistency / world degradation

Autoregressive systems generate one frame at a time and feed their output back in, so small errors can accumulate. A world may look good at first, then drift, distort, or forget what should remain stable.

The reality gap

Simulated behaviour does not automatically transfer to the real world. Real environments are messier, and closing the sim2real gap is still one of the hardest robotics problems.

Causal control vs. correlation

A model may learn that events co-occur without understanding what causes what. That is dangerous if the system is used for planning or intervention.

Compute cost

Real-time spatial consistency is expensive. World models need to handle time, space, interaction, and high-dimensional visual data, often at many frames per second.

Data provenance and licensing

Training on gameplay, video, spatial data, or internet media raises copyright and licensing questions. This is especially sensitive in gaming and creative industries.

Benchmarking and generality

There is no single agreed benchmark for how “general” a world model is. Claims about generality should be treated carefully until evaluation methods mature.

Action grounding

JEPA-style latent prediction is elegant, but converting abstract understanding into precise physical action remains hard. This is the planner bottleneck.

Conclusion: What to Watch Next

The most useful mental model for this field is the render → simulate → plan ladder.

Renderers are visible and commercial today. Simulators are becoming the strategic battleground. Planners are the unsolved prize.

Watch for four things.

First, longer stable interactive sessions. If generated worlds can remain coherent for much longer, the category becomes more useful.

Second, open models and benchmarks. Cosmos, V-JEPA, HunyuanWorld, and other open efforts matter because they let the field test and build.

Third, robotics transfer. The strongest proof will be robots learning from world models and performing new tasks in real environments.

Fourth, governance and provenance. As world simulators improve, watermarking, copyright, data rights, and misuse risks become more important.

The strategic punchline is that world models are complements to LLMs, not replacements. The likely architecture pairs an LLM for language and high-level reasoning with a world model for spatial and physical prediction, then connects both to a planner or VLA model for action.

Language models gave machines command of concepts. World models are how AI may come to grips with reality.

For now, the disciplined view is this:

renderers are real products, simulators are infrastructure, and planners are where the next decade of defensibility may be won or lost.

Glossary

Term	Meaning
World model	An AI system that learns an internal simulation of an environment and predicts how it changes, especially in response to actions
Action-conditioning	Predicting the next state given an action; the defining difference between a world model and a passive video generator
VAE	Variational Autoencoder; compresses raw images into a compact latent vector and can reconstruct them
Latent space	A compressed internal representation of information
MDN-RNN	A recurrent memory model that predicts the next latent state as a probability distribution
RSSM	Recurrent State-Space Model, used in Dreamer to combine deterministic memory and stochastic uncertainty
JEPA	Joint Embedding Predictive Architecture; predicts abstract future representations rather than reconstructing pixels
VLA model	Vision-Language-Action model; maps what a robot sees and is told into motor actions
Gaussian splats	A 3D scene representation using many semi-transparent particles instead of traditional polygon meshes
Sim2real	The challenge of transferring a policy trained in simulation to the real world
Digital twin	A simulated version of a real physical system, such as a factory, warehouse, machine, or robot cell
Planner	The layer that decides what action an agent should take

#AI#WORLD MODELS#SERIES 101#ROBOTICS#AUTONOMOUS VEHICLES#PHYSICAL AI#DIGITAL TWINS